> It’s unclear which image data Mistral might have used to develop Pixtral 12B.
The days of free web scraping especially for the richer sources of material are almost gone, with anything between technical (API restrictions) and legal (copyright) measures building deep moats. I also wonder what they trained it on. They're not Meta or Google with endless supplies of user content, or exclusive contracts with the Reddits of the internet.
My hunch is that most AI labs are already sitting on a pretty sizable collection of scraped image data - and that data from two years ago will be almost as effective as data scraped today, at least as far as image training goes.
I have multiple ad-blockers running, how am I different from a bot scouring the “free” web? I get the idea of copyright and creators wanting to be paid for their content. However, I think there are plenty of human users out there not “paying” for “free” content either. Which one is a greater loss of revenue? A collection of over a million humans? Or 100 or so corporate bots?
I would say the opposite, it has never been easier to collect a huge amount of data, in particular if you have a target, also you don't even need to write a line of code if you are good at explaining Claude 3.5 Sonnet what you want to achieve and the details.
img2dataset also exists
1. This is a VLM, not a text-to-image model. You can give it images, and it can understand them. It doesn't generate images back.
2. It seems like Pixtral 12B benchmarks significantly below Qwen2-VL-7B [1], so if you want the best local model for understanding images, probably use Qwen2. If you want a large open-source model, Qwen2-VL-72B is most likely the best option.
Only the 2&7B have been "open sourced". From your link:
>We opensource Qwen2-VL-2B and Qwen2-VL-7B with Apache 2.0 license, and we release the API of Qwen2-VL-72B!
New Mistral AI Weights
Also, can your model of choice understand your requests to include/omit particular nuances of an image?
For example, I have a couple way-too-wordy captions made with another captioner, which I'd like to cut down to the essentials while correcting any mistakes. Qwen2 is completely ignoring images with this approach, and decides to only focus on the given caption, which makes it unable to even remotely fix issues in said caption.
I am really hoping Pixtral will be better for instruction following. But I haven't been able to run it because they didn't prioritize transformers support, which in turn has hindered the release of any quantized versions to make it fit on consumer hardware.
I don’t believe you can really prompt it though, but the other models where I could also didn’t work well on that front anyways.
TagGui is an easy way to try out a bunch of models.
finetune/make_captions.py ... \
--num_beams=12 \
--top_p=0.9 \
--max_length=75 \
--min_length=24 \
--beam_search \
...
With this, it's very often that I just take its caption as is, or add little.TagGui
Oh, interesting, thanks!
Like writing on an ePaper tablet, exporting the PDF and feed this into this model to extract todos from notes for example.
Or what would be the SotA for this application?
Probably not on the device itself but I would love that use case as well. At least going to my own server. I’d want to protect notes in particular, which is why I don’t do any cloud backup on my RM2. But some self hosted, AI assisted OCR workflows could be really nice.
For a general knowledge chatbot it doesn't know much of course, but its a good worker bee.