The next version can see the image and read the metadata.
A bit more context: We are include everything in the latent space (embeddings) without trying to maintain multiple indexes and hack around things. There is still a huge mountain to climb. But this one seems really promising.