PaLI-3 Vision Language Models (opens in new tab)

(arxiv.org)

176 pointsmaccaw2y ago23 comments

23 comments

Something that stood out to me skimming the paper - that was somewhat buried - they finetune the model on each benchmark.

"Finally, for each individual task (benchmark), we fine-tune the PaLI-3 model with frozen ViT image encoder on the task’s training data as described in the cor- responding section. For most tasks, we fine-tune the 812×812 resolution checkpoint, but for two document understanding tasks, we go up to 1064×1064 resolution"

S̶o̶ ̶t̶h̶i̶s̶ ̶i̶s̶ ̶c̶o̶m̶p̶a̶r̶i̶n̶g̶ ̶a̶ ̶s̶m̶a̶l̶l̶e̶r̶ ̶m̶o̶d̶e̶l̶ ̶f̶i̶n̶e̶t̶u̶n̶e̶d̶ ̶p̶e̶r̶ ̶b̶e̶n̶c̶h̶m̶a̶r̶k̶ ̶t̶o̶ ̶l̶a̶r̶g̶e̶r̶ ̶m̶o̶d̶e̶l̶s̶ ̶t̶h̶a̶t̶ ̶I̶ ̶p̶r̶e̶s̶u̶m̶e̶ ̶a̶r̶e̶ ̶n̶o̶t̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶I̶ ̶h̶a̶v̶e̶ ̶n̶o̶t̶ ̶r̶e̶a̶d̶ ̶t̶h̶e̶ ̶P̶a̶l̶i̶-̶X̶ ̶p̶a̶p̶e̶r̶.̶

Edit - No, I was wrong, Palm-X is also fine-tuned before each task/set of tasks.

Impressive improvement!!!

nerdponx2y ago

Are they just fine-tuning part of the model on the "unsupervised" portion of the training data? I think that's not entirely unfair because it might be realistic. If you have a big corpus of data and a pre-existing model, you might want to fine tune the latter using the former. However it's certainly a generous benchmark and doesn't reflect real-world "online" usage.

atorodius2y ago

That's normal for ML

buildbot2y ago

To finetune on each benchmark? I'd say it's not in our modern era of in-context learning, though of course fine-tuning has it's place as well for making smaller models better in one domain than a generalist larger model.

tracyhenry2y ago

maybe someone more informed can help me understand why they didn't compared to Llava (https://llava-vl.github.io/)?

kolja0052y ago

The purpose of this research is to compare large vision-language models where the vision component is pre-trained using different techniques, namely on image classification versus unsupervised contrastive pre-training (see OpenAI's CLIP). PaLI-3 also isn't an instruction-tuned model, so comparing it to Llava would be a little apples-to-oranges.

dartos2y ago

Maybe they just didn’t know about llava while conducting their research. It can take days to train a model sometimes.

buildbot2y ago

Weeks to months at larger scales even.

light_hue_12y ago

No comparison against GPT-4V? How embarrassing! Where are they going to submit this? A conference where no one knows about GPT-4V? Ridiculous.

It's getting really awkward seeing these papers from Google. "We're here too! We're totally not woefully behind everyone else in the field!". No model, no reasonable comparisons, just generic bragging.

I'm astounded as an ML researcher how Google can be doing so incredibly badly. They have access to unlimited compute, good people, and great infrastructure. Yet something about their internal culture means they are unable to compete with OpenAI, Facebook, and even the open source community. They constantly brag about how good their models are (even in private) and then every time they deploy anything its performance is pathetic (like Bard and Bard with vision).

You can tell why Google recently totally overhauled the leadership of Google Research/Deep Mind and shut down Google Brain.

staticman22y ago

I just tried translating a camera image of a japanese manga page with Chatgpt vs Bard, and Bard greatly outperformed ChatGPT in recognizing the japanese kanji, for what it's worth.

zarzavat2y ago

Realistically you can’t expect one company to dominate like before. The days of buying up all the researchers and giving them unlimited compute are over. Now that there is more investment, it will mostly be about who is lucky to make discoveries, and who is willing to bend the law enough to get breakthroughs. Google’s status as a large multinational may end up as a disadvantage.

standardly2y ago

I imagine their AI model isn't off the ground becuase it hasn't integrated with ads yet. They ruined search and youtube for more ad impressions, and likely have the same strategy with AI.

Geee2y ago

Fundamentally the reason is that they can't make money with it, and AI eats into their search revenue. Search as a business is dead, and AI can't bring in the money.

biomattr2y ago

Maybe its a conscious choice not to be the most frontier model? Cause its hard to believe they're not capable of being the hare in this race.

kolja0052y ago

I probably could have guessed that contrastive pre-training works better for downstream vision-language tasks than image classification pre-training, but it's nice to see this hypothesis thoroughly tested here.

Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I'm less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open.

mattnewton2y ago

I don’t see why not- “segment anything” from meta seems to handle labeled pixel-wise segmentation masks fairly well. You can also get rough masks today by looking at where the text part of the model attends to in the image part.

sgd992y ago

can anyone explain how these visual tokens which are concatenated with the tokenizer outputs for the encoder are created?

kolja0052y ago

I was a little confused about this too. The authors say in the paper:

"The outputs of the ViT image encoder before pooling form the visual tokens, which are linearly projected and prepended to the embedded input text tokens."

I took a look at the HuggingFace implementation of ViT [1]. After the ViT encoder blocks there's a layer norm and then a pooling layer (line 595), where the pooling layer involves taking the first token output from the layer norm and running it through a dense layer. So, it looks like in PaLI-3 the tokens are the hidden states output by the layer norm after the ViT encoder blocks.

[1] https://github.com/huggingface/transformers/blob/main/src/tr...

sgd992y ago

thank you!

facu17y2y ago

no github?

Technotroll2y ago

Does the vision-language-model process raw image data, or does it process OCR character output?

bigfudge2y ago

Gpt4v seems to be doing the former, at least in my experiments with it. It interprets plots and categorises images.

doggerel2y ago

The copyright violation is coming from inside the house.

Even undigitized materials aren't safe any more.

j / k navigate · click thread line to collapse

23 comments

buildbot2y ago

Something that stood out to me skimming the paper - that was somewhat buried - they finetune the model on each benchmark.

Edit - No, I was wrong, Palm-X is also fine-tuned before each task/set of tasks.

Impressive improvement!!!

nerdponx2y ago

atorodius2y ago

That's normal for ML

buildbot2y ago

tracyhenry2y ago

maybe someone more informed can help me understand why they didn't compared to Llava (https://llava-vl.github.io/)?

kolja0052y ago

dartos2y ago

Maybe they just didn’t know about llava while conducting their research. It can take days to train a model sometimes.

buildbot2y ago

Weeks to months at larger scales even.

light_hue_12y ago

No comparison against GPT-4V? How embarrassing! Where are they going to submit this? A conference where no one knows about GPT-4V? Ridiculous.

You can tell why Google recently totally overhauled the leadership of Google Research/Deep Mind and shut down Google Brain.

staticman22y ago

I just tried translating a camera image of a japanese manga page with Chatgpt vs Bard, and Bard greatly outperformed ChatGPT in recognizing the japanese kanji, for what it's worth.

zarzavat2y ago

standardly2y ago

I imagine their AI model isn't off the ground becuase it hasn't integrated with ads yet. They ruined search and youtube for more ad impressions, and likely have the same strategy with AI.

Geee2y ago

Fundamentally the reason is that they can't make money with it, and AI eats into their search revenue. Search as a business is dead, and AI can't bring in the money.

biomattr2y ago

Maybe its a conscious choice not to be the most frontier model? Cause its hard to believe they're not capable of being the hare in this race.

kolja0052y ago

mattnewton2y ago

sgd992y ago

can anyone explain how these visual tokens which are concatenated with the tokenizer outputs for the encoder are created?

kolja0052y ago

I was a little confused about this too. The authors say in the paper:

"The outputs of the ViT image encoder before pooling form the visual tokens, which are linearly projected and prepended to the embedded input text tokens."

[1] https://github.com/huggingface/transformers/blob/main/src/tr...

sgd992y ago

thank you!

facu17y2y ago

no github?

Technotroll2y ago

Does the vision-language-model process raw image data, or does it process OCR character output?

bigfudge2y ago

Gpt4v seems to be doing the former, at least in my experiments with it. It interprets plots and categorises images.

doggerel2y ago

The copyright violation is coming from inside the house.

Even undigitized materials aren't safe any more.

j / k navigate · click thread line to collapse