At Avy.ai we're running small (2B-7B, quantized) vision models as part of a Mac desktop application for understanding what someone is working on in the moment, to offer them related information and actions.
We found that the raw results in understanding the images with a light LORA fine tune are not substantially different -- but the ease of getting a small model to follow instructions in outputting structured data in response to the image and at the level of verbosity and detail we need is greatly enhanced with fine tuning. Without fine tuning the models on the smaller end of that scale would be much more difficult to use, not reliably producing output that matched what the consuming application expects