I'm using small and medum.
Also the code for using it is very short and easy to use. You can also use ChatGPT to generate small exepriments to see what fits your case better
They are super fast.
Its just an alternative i'm mentioning. I would assume a person knowing a little bit of that domain.
Otherwise the first option would be CLIP i assume. llm-vl is just super slow and compute intensive.