This is colossal. It can creates embeddings on pretty much any type of format, video, audio, documents. The context is still a bit small compared to what we are used to in text, but this seems major
How does it compare with qwens open weight multimodal embedding model? Anyone know? This seems lesser form what i read, with the drawback of bei g via some api/model i dont have control over. Qwen gives great ebeddings out of the gate while also being steerable, i.e. you can supply a prompt to focus on embedding specific tasks with higher resolution, which in my tests has been mind-blowingly good. Not seeing the value add here.
the steerability point is interesting. have you tried using task-specific prompts for cross-modal retrieval though? like searching images with text queries. curious whether qwen's prompt-based steering actually helps there or if it mainly improves same-modality tasks. the 3072-dim space seems tight for encoding all those modalities well.
Does well in my tests, limited as they were, but it did well in zero-shot tasks in niche domains historically (and possibly still here) underrepresented in training data (microscopy)