Mistral releases Pixtral 12B, its first multimodal model (opens in new tab)

(techcrunch.com)

163 pointsjerbear43281y ago40 comments

40 comments

The "Mistral Pixtral multimodal model" really rolls off the tongue.

> It’s unclear which image data Mistral might have used to develop Pixtral 12B.

The days of free web scraping especially for the richer sources of material are almost gone, with anything between technical (API restrictions) and legal (copyright) measures building deep moats. I also wonder what they trained it on. They're not Meta or Google with endless supplies of user content, or exclusive contracts with the Reddits of the internet.

simonw1y ago

What do you mean by copyright measures? Has anything changed on that front in the last two years?

My hunch is that most AI labs are already sitting on a pretty sizable collection of scraped image data - and that data from two years ago will be almost as effective as data scraped today, at least as far as image training goes.

dartos1y ago

The issue with image models is that their style becomes identifiable and stale quite quickly, so you’ll need a fresh intake of different, newer, styles every so often and that’s going to be harder and harder to get.

4 more replies

bronco210161y ago

At what point does an agent sitting at a browser collecting information differ from a human?

I have multiple ad-blockers running, how am I different from a bot scouring the “free” web? I get the idea of copyright and creators wanting to be paid for their content. However, I think there are plenty of human users out there not “paying” for “free” content either. Which one is a greater loss of revenue? A collection of over a million humans? Or 100 or so corporate bots?

a21281y ago

Humans use Google Chrome from their home IP address that isn't on any blacklists, and they're always happy to make an account and download an app instead of accessing a website. Or at least that's what companies think humans are

GaggiX1y ago

>The days of free web scraping especially for the richer sources of material are almost gone

I would say the opposite, it has never been easier to collect a huge amount of data, in particular if you have a target, also you don't even need to write a line of code if you are good at explaining Claude 3.5 Sonnet what you want to achieve and the details.

jazzyjackson1y ago

You don't need a contract with reddit to scrape it, you can just add `.json` to any url and you'll get the entire thread as one object.

8n4vidtmkvmk1y ago

They have very heavy rate limits on their 1st party api now. I can't even delete my own content, nevermind scrape.

1 more reply

htrp1y ago

there are torrents all over the internet of AI training data for images and video....

img2dataset also exists

reissbaker1y ago

Couple notes for newcomers:

1. This is a VLM, not a text-to-image model. You can give it images, and it can understand them. It doesn't generate images back.

2. It seems like Pixtral 12B benchmarks significantly below Qwen2-VL-7B [1], so if you want the best local model for understanding images, probably use Qwen2. If you want a large open-source model, Qwen2-VL-72B is most likely the best option.

1: https://qwenlm.github.io/blog/qwen2-vl/

Jackson__1y ago

>If you want a large open-source model, Qwen2-VL-72B is most likely the best option.

Only the 2&7B have been "open sourced". From your link:

>We opensource Qwen2-VL-2B and Qwen2-VL-7B with Apache 2.0 license, and we release the API of Qwen2-VL-72B!

aucisson_masque1y ago

Mistral being more open than 'openai' is kind of a meme. How can a company call itself open while it refuses to openly distribute it's product and when competitor are actually doing it.

seydor1y ago

Meta too. Openai is an ironic name now

ChrisArchitect1y ago

Related earlier:

New Mistral AI Weights

https://news.ycombinator.com/item?id=41508695

azinman21y ago

I’d love to know how much money Mistral is taking in versus spending. I’m very happy for all these open weights models, but they don’t have Instagram to help pay for it. These models are expensive to build.

candiddevmike1y ago

No license with this one yet, though you can probably assume it's Apache like the others.

mdasen1y ago

The article says they confirmed it's Apache via email

wruza1y ago

A question for sd lora trainers, is this usable for making captions and what are you using, apart from BLIP?

Also, can your model of choice understand your requests to include/omit particular nuances of an image?

Jackson__1y ago

I like Qwen2-VL 7B because it outputs shorter captions with less fluff. But if you need to do anything advanced that relies on reasoning and instruction following the model completely falls flat on it's face.

For example, I have a couple way-too-wordy captions made with another captioner, which I'd like to cut down to the essentials while correcting any mistakes. Qwen2 is completely ignoring images with this approach, and decides to only focus on the given caption, which makes it unable to even remotely fix issues in said caption.

I am really hoping Pixtral will be better for instruction following. But I haven't been able to run it because they didn't prioritize transformers support, which in turn has hindered the release of any quantized versions to make it fit on consumer hardware.

AuryGlenz1y ago

I’m no expert but Florence2 has been my go-to. It’s pretty great at picking up art styles and IP stuff - “The image depicts Goku from the anime series Dragonball Z…”

I don’t believe you can really prompt it though, but the other models where I could also didn’t work well on that front anyways.

TagGui is an easy way to try out a bunch of models.

wruza1y ago

Yeah, blip mostly ignores prompt too. I tried to disassemble it and feed my prompts, to no avail. Although I found that default kohya gui arguments are not even remotely the best. Here's my args:

  finetune/make_captions.py ... \
    --num_beams=12 \
    --top_p=0.9 \
    --max_length=75 \
    --min_length=24 \
    --beam_search \
    ...

With this, it's very often that I just take its caption as is, or add little.

TagGui

Oh, interesting, thanks!

Flockster1y ago

Could this be used for a selfhosted handwritten text recognition instance?

Like writing on an ePaper tablet, exporting the PDF and feed this into this model to extract todos from notes for example.

Or what would be the SotA for this application?

tonygiorgio1y ago

> the 12-billion-parameter model is about 24GB in size

Probably not on the device itself but I would love that use case as well. At least going to my own server. I’d want to protect notes in particular, which is why I don’t do any cloud backup on my RM2. But some self hosted, AI assisted OCR workflows could be really nice.

jhgg1y ago

Try out https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

whimsicalism1y ago

if you have a 3090, you could self host

edude031y ago

12B is pretty small, so I’m doubting it’ll be anywhere close to internvl2 however mistral does great work and likely this model is still useful for on device tasks

Jackson__1y ago

It appears to be slightly worse than Qwen2VL 7B, a model almost half it's size, if you look at the Qwen's official benchmarks instead of Mistral's.

https://xcancel.com/_philschmid/status/1833954941624615151

kaoD1y ago

But Qwen is not multimodal, or is it?

1 more reply

jazzyjackson1y ago

I've found llama 3.1 8B to be effective at transforming unstructured text into structured data, now that LM Studio accepts a json schema parameter.

For a general knowledge chatbot it doesn't know much of course, but its a good worker bee.

j / k navigate · click thread line to collapse

40 comments

buran771y ago

The "Mistral Pixtral multimodal model" really rolls off the tongue.

> It’s unclear which image data Mistral might have used to develop Pixtral 12B.

simonw1y ago

What do you mean by copyright measures? Has anything changed on that front in the last two years?

dartos1y ago

4 more replies

bronco210161y ago

At what point does an agent sitting at a browser collecting information differ from a human?

a21281y ago

GaggiX1y ago

>The days of free web scraping especially for the richer sources of material are almost gone

jazzyjackson1y ago

You don't need a contract with reddit to scrape it, you can just add `.json` to any url and you'll get the entire thread as one object.

8n4vidtmkvmk1y ago

They have very heavy rate limits on their 1st party api now. I can't even delete my own content, nevermind scrape.

1 more reply

htrp1y ago

there are torrents all over the internet of AI training data for images and video....

img2dataset also exists

reissbaker1y ago

Couple notes for newcomers:

1. This is a VLM, not a text-to-image model. You can give it images, and it can understand them. It doesn't generate images back.

1: https://qwenlm.github.io/blog/qwen2-vl/

Jackson__1y ago

>If you want a large open-source model, Qwen2-VL-72B is most likely the best option.

Only the 2&7B have been "open sourced". From your link:

>We opensource Qwen2-VL-2B and Qwen2-VL-7B with Apache 2.0 license, and we release the API of Qwen2-VL-72B!

aucisson_masque1y ago

Mistral being more open than 'openai' is kind of a meme. How can a company call itself open while it refuses to openly distribute it's product and when competitor are actually doing it.

seydor1y ago

Meta too. Openai is an ironic name now

ChrisArchitect1y ago

Related earlier:

New Mistral AI Weights

https://news.ycombinator.com/item?id=41508695

azinman21y ago

candiddevmike1y ago

No license with this one yet, though you can probably assume it's Apache like the others.

mdasen1y ago

The article says they confirmed it's Apache via email

wruza1y ago

A question for sd lora trainers, is this usable for making captions and what are you using, apart from BLIP?

Also, can your model of choice understand your requests to include/omit particular nuances of an image?

Jackson__1y ago

AuryGlenz1y ago

I’m no expert but Florence2 has been my go-to. It’s pretty great at picking up art styles and IP stuff - “The image depicts Goku from the anime series Dragonball Z…”

I don’t believe you can really prompt it though, but the other models where I could also didn’t work well on that front anyways.

TagGui is an easy way to try out a bunch of models.

wruza1y ago

Yeah, blip mostly ignores prompt too. I tried to disassemble it and feed my prompts, to no avail. Although I found that default kohya gui arguments are not even remotely the best. Here's my args:

  finetune/make_captions.py ... \
    --num_beams=12 \
    --top_p=0.9 \
    --max_length=75 \
    --min_length=24 \
    --beam_search \
    ...

With this, it's very often that I just take its caption as is, or add little.

TagGui

Oh, interesting, thanks!

Flockster1y ago

Could this be used for a selfhosted handwritten text recognition instance?

Like writing on an ePaper tablet, exporting the PDF and feed this into this model to extract todos from notes for example.

Or what would be the SotA for this application?

tonygiorgio1y ago

> the 12-billion-parameter model is about 24GB in size

jhgg1y ago

Try out https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct

whimsicalism1y ago

if you have a 3090, you could self host

edude031y ago

12B is pretty small, so I’m doubting it’ll be anywhere close to internvl2 however mistral does great work and likely this model is still useful for on device tasks

Jackson__1y ago

It appears to be slightly worse than Qwen2VL 7B, a model almost half it's size, if you look at the Qwen's official benchmarks instead of Mistral's.

https://xcancel.com/_philschmid/status/1833954941624615151

kaoD1y ago

But Qwen is not multimodal, or is it?

1 more reply

jazzyjackson1y ago

I've found llama 3.1 8B to be effective at transforming unstructured text into structured data, now that LM Studio accepts a json schema parameter.

For a general knowledge chatbot it doesn't know much of course, but its a good worker bee.

j / k navigate · click thread line to collapse