As mentioned in another comment, you really have to test different hashing algorithms to find one that suits your needs best. In general though, I think it is in most cases not necessary to develop an algorithm from scratch :)
For us, the much more challenging part was/is to develop a system that can find similar images quickly. If you are interested in things like that have a look at VP/MVP data structures (e.g. http://pnylab.com/pny/papers/vptree/vptree/, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7...).
It could be a great extension to Redis DSL.
I'm currently researching MVP's and reading on VP-trees, BK-trees [1], GNAT [2] and HEngine [3]. Do you have any advice?
[1] http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK...
[2] http://www.vldb.org/conf/1995/P574.PDF
[3] https://www.cse.msu.edu/~alexliu/publications/HammingQuery/H...
The thing is though, you won't have difficulties finding papers on those topics. However, you will probably not have any luck finding many concrete and practical implementations that you could look at.
So it's a far way from reading the papers to having something working.
If you find something, please let me know.
To give a concrete example, I noticed recently that a lot of app store icons within a category look pretty similar. See for example this:
https://twitter.com/acslater00/status/450127865682489344/pho...
All of those checkmarks certainly seem to my naked eye like they the binary diff transformation would result in a very identical hash. It seems like the rounded corners in 'Finish' would blur out, and the texture in 'OmniFocus 2' would blur out, and the gradient in 'Clear' would look identical to a flat gradient on the right side of the checkmark.
Anyway, clever algorithm but curious how it works in practice on small icons?
Idealy we would like to move to a content-based image retrieval system where we would be able to search based on certain features that can be derived from the image itself (color, shape, texture for example) so we could fine tune our results.
Yes, the presented example is a curious case, if we take the first three icons there and compare them based on share and color we can see that their shape is identical but the background is different. Based on this, should we consider different or identical? You can't have too many variations on a simple shape like a checkmark or a facebook logo so what variations should you allow and which ones would you consider as copying previous work?
OpenCV has awesome Python bindings (cv2), btw.
[1] http://stackoverflow.com/questions/2146542/opencv-surf-how-t...
[1]http://www.wisdom.weizmann.ac.il/~bagon/CVspring07/files/sca... [2]http://www.cc.gatech.edu/~phlosoft/files/schindler07cvpr2.pd...
[1] http://docs.opencv.org/modules/features2d/doc/object_categor...
2) It's not too hard to program a Haar wavelet[1] transform[2] (basically iterative scaled thesholding). This has worked well over at IQDB[3], where they do reverse image lookups on databases of over several million images via a modified Haar wavelet.
You can't beat this algorithm for simplicity, though. Have you guys done any false positive checks with this algorithm? The saving grace might be that icons are fairly small and limited in detail/color.
[1] http://en.wikipedia.org/wiki/Haar_wavelet
[2] http://stackoverflow.com/questions/1034900/near-duplicate-im...
[3] http://iqdb.org/
[1] http://en.wikipedia.org/wiki/Scale-invariant_feature_transfo...
[2] http://en.wikipedia.org/wiki/SURF
[3] http://www-db.deis.unibo.it/research/papers/IWOSS99.ABP.pdf
[4] http://www.hackerfactor.com/blog/index.php?/archives/529-Kin...
[5] http://www.hackerfactor.com/blog/index.php?/archives/529-Kin...
[1] http://en.wikipedia.org/wiki/Scale-invariant_feature_transfo...
SELECT pk, hash, file_path FROM image_hashes
WHERE hash = '4c8e3366c275650f';
This is screaming to be stored in a bloom filter.Also, we will be integrating this into the reviewing process for an iconset, where we also do a manual quality check, showing possible matches to something currently uploaded so skimming over one or two false positives isn't such a big deal and we where more interested in the speed of the algorithm.
You could also hash forwards, backwards and starting at several random midpoints to prevent someone from simply changing the first block to throw off the hashing algorithm.
Plus, you still have to handle other formats like png, and get the same output from the same image.
Edit:
From the last paragraph. > we will be using the algorithm in the future on Iconfinder to warn us if a submitted icon already exists in our collection but it can also have other practical uses. For example, because images with similar features have the hamming distance of their hashes close, it can also be used as a basic recommendation system where the recommended images are within a certain hashing distance of the current image.
No need to say, in Java, or in Python. It can be done in any lang.