*I developed this after I saw how poor google (and any other reverse image search engine I tested, like bing/tineye) performed on rotated images, for example google doesn't match this image [0] to it's original [1] (google's neural network will find that it is a cat but won't match to the original). After playing around with trying to make my algorithm (uniform/nonuniform) scale and rotation invariant I found that I was able to make it fully 2D affine invariant (pretty much by luck)
I developed all this in my spare time actively for about about a year. I also have a c++ implementation that let's you reproduce all this but it's just a proof of concept and so it's quite slow. You can check that out on my github here [2]. There are loads of ways this can be improved but I wanted to get the idea out quickly.
[0] https://github.com/pippy360/transformationInvariantImageSear...
[1] https://github.com/pippy360/transformationInvariantImageSear...
[2] https://github.com/pippy360/transformationInvariantImageSear...
Great work there!
One question - Why are we transforming the triangles to equilateral triangles?
Just throwing out a few ideas and wanted to discuss what potential flaws could they have.
1. For keypoint detection, what if you use ASIFT (Affine SIFT) which is Affine transformation friendly. In that case, you'd probably save time doing the rotations. Given the huge number of proposals we get, we might still need to filter out some keypoint proposals from this may be using a metric like if too many keypoints are within some 2D window, then choose the one which is farthest from all edges of the 2D window (very rough, don't know if there are other ways)
2. With the final set of keypoints, I propose that let us do Delaunay Triangulation in the hope of getting a collection of triangles which cover the complete surface area of the image making it a spatially equidistant breakdown of the image pixels.
3. Hash those triangles (maybe? how?)
4. Now given a query image, perform steps 1-3, and find the triangles which match the triangles from a database image. If the fraction of matches are above a given threshold, then this is a potential candidate for the search result.
This is so that the extracted image fragments/triangles are always very similar and so have an extremely similar hash. If I didn't transform the triangles then, for example, a fragment of a query image might be stretched compared with the matching fragment in the database. Intuitively it makes sense that it would be easier to match image fragments if they have the same scale/aren't stretched versions of each other. The transformation to equilateral triangles is what makes the algorithm 2D affine-invariant.
>what if you use ASIFT (Affine SIFT) which is Affine transformation friendly
I only discovered ASIFT while reading this thread, it looks very interesting and useful but I'll need to look into it more. I did test out SIFT as a keypoint finder and it performed really poorly for what I wanted to do but I didn't do a lot of testing. My current method of finding keypoints is a load of crap that I threw together for this demo, a better one would seriously improve the accuracy and speed.
>like if too many keypoints are within some 2D window/I propose that let us do Delaunay Triangulation
I looked into both Delaunay Triangulation and a 2D window as a way to decrease the number of proposals/triangles generated, but had problems with both. I will still try later to implement a 2D window. As for Delaunay Triangulation, if you play around with it on paper you can see that it's not 2D affine invariant, it might still work well for small transformations but there are affine transformations where it fails. Still I think it would be worth looking into more.
The big problem is, a moderately complex image can yield n³ triangles, so you have to limit your feature space in a way that would have consistent behavior across transformed versions of the image.
This is a big problem I have been facing. You will notice that all the images I use in my c++ demos are small for this reason. I'm trying to solve this scaling issue but I wasn't able to get a solution quickly so I decided to release this and continue to work on it and hopefully release an improved version at some point (or let other people work on it and hopefully find solutions).
For example if you have 50 overlapping triangles you have to decide which 49 to remove and you have to remove the same 49 on the query image and the matching image. But because we want to be able to do our searches in sublinear time we can't compare the two images and decided on which triangles should stay/be removed, you have to do all that in preprocessing before inserting into the database/querying. delaunay triangulation looks almost perfect but isn't invariant to 2D affine transformations
Reminds me of nova.astrometry.net and PixInsight (and others), which are able to determine which stars are in an image, regardless of transforms.
Relevant paper (outlines an approach using triangle space): https://hal.inria.fr/inria-00548539/document
Relevant section: "Provided the reduced object lists, our next step is to construct all possible triangles. From the list of n objects we can construct n(n − 1)(n − 2)/6 triangles. Those will be represented in so called "triangle space". We need to choose a triangle representation that will let us find similar triangles, that is to find corresponding object triplets being insensitive to translation, rotation, scaling and flipping. A triangle in the triangle space could be represented as a two-dimensional point (x, y) where x = a/b, y = b/c.
a, b and c are the lengths of triangle sides in decreasing order. Similar triangles will be located close to each other in the triangle space."
Have you considered doing other common transformations like various Instagram filters and compression loss? If you had those abilities combined with the current algorithm you could have the kernel of a photo copyright protection system . . .
There are various ways to deal with this in ML (not really an expert in ML), but AFAIU in most cases you get candidate transforms via model-based methods, and then make the decision using ML-based methods trained to allow for limited transformations (convolution nets are good at that).
Okcupid talked about this in an article about image hashing [0] and they have a nice quote:
"The end-to-end approach of training a Convolutional Net to embed images such that the positive / negative case distances separate nicely is probably the cool new hypebeast way forward. This would have the added advantage of being able to handle whatever transformations you can think of at training time."
[0]https://tech.okcupid.com/evaluating-perceptual-image-hashes-...
What research is out there on general Affine invariant vision algorithms and techniques?
For context, my practical experience ended in the mid 2000's when "jittering" was a bit of thing along with some occasional closed form estimators based on basic linearization approximations.
Yeah, the query image is broken into a number of fragments and then the hash of each fragment/triangle is checked against the hashes stored in the database using a nearest neighbor search (well you find any neighbors within a certain distance/threshold, I found a hamming distance of 8 to be a pretty good threshold).
Currently only one fragment of the query image needs to match a database image for the images to be considered a match. You could also put some threshold on the number of fragments that need to match but this is only a proof of concept I haven't had time to find the perfect combination yet.
>could this be extended to colour transformations
Yeah, actually after you have re-transformed the triangles to be equilateral you can apply a lot of other techniques to future improve the search. You can even use NON-affine transformation partial image matching on the triangles to match fragments that only partially match fragments stored in the database.
I'm currently working on an improved method for getting 2D affine invariant keypoints because the current one just barely works. After that I will be working on handling the scaling issue.
When I've done it in the past I've created a 5x5 greyscale thumbnail of each and then you done bitwise comparisons. You can do more complex stuff like normalisation of brightness. Really the main thing to focus on is the vectorisation so you can quickly compare truckloads of images.
If I was to do it now I'd probably use width:height ratio to reduce the search space and then quickly check the 4 rotations of my sample hash against all known hashes of the same ratio. And I'd probably start but finding all images that were under a certain size, assume they were thumbnails, and try to find matching originals so I could remove them from the data set.
It depends on what exactly has happened to your library in the first place but I'd be surprised if you had lots of images that had undergone arbitrary transformations. Though obviously, only you know what that data looks like to start with! :-)