"Finally, for each individual task (benchmark), we fine-tune the PaLI-3 model with frozen ViT image encoder on the task’s training data as described in the cor- responding section. For most tasks, we fine-tune the 812×812 resolution checkpoint, but for two document understanding tasks, we go up to 1064×1064 resolution"
S̶o̶ ̶t̶h̶i̶s̶ ̶i̶s̶ ̶c̶o̶m̶p̶a̶r̶i̶n̶g̶ ̶a̶ ̶s̶m̶a̶l̶l̶e̶r̶ ̶m̶o̶d̶e̶l̶ ̶f̶i̶n̶e̶t̶u̶n̶e̶d̶ ̶p̶e̶r̶ ̶b̶e̶n̶c̶h̶m̶a̶r̶k̶ ̶t̶o̶ ̶l̶a̶r̶g̶e̶r̶ ̶m̶o̶d̶e̶l̶s̶ ̶t̶h̶a̶t̶ ̶I̶ ̶p̶r̶e̶s̶u̶m̶e̶ ̶a̶r̶e̶ ̶n̶o̶t̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶I̶ ̶h̶a̶v̶e̶ ̶n̶o̶t̶ ̶r̶e̶a̶d̶ ̶t̶h̶e̶ ̶P̶a̶l̶i̶-̶X̶ ̶p̶a̶p̶e̶r̶.̶
Edit - No, I was wrong, Palm-X is also fine-tuned before each task/set of tasks.
Impressive improvement!!!
It's getting really awkward seeing these papers from Google. "We're here too! We're totally not woefully behind everyone else in the field!". No model, no reasonable comparisons, just generic bragging.
I'm astounded as an ML researcher how Google can be doing so incredibly badly. They have access to unlimited compute, good people, and great infrastructure. Yet something about their internal culture means they are unable to compete with OpenAI, Facebook, and even the open source community. They constantly brag about how good their models are (even in private) and then every time they deploy anything its performance is pathetic (like Bard and Bard with vision).
You can tell why Google recently totally overhauled the leadership of Google Research/Deep Mind and shut down Google Brain.
Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I'm less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open.
"The outputs of the ViT image encoder before pooling form the visual tokens, which are linearly projected and prepended to the embedded input text tokens."
I took a look at the HuggingFace implementation of ViT [1]. After the ViT encoder blocks there's a layer norm and then a pooling layer (line 595), where the pooling layer involves taking the first token output from the layer norm and running it through a dense layer. So, it looks like in PaLI-3 the tokens are the hidden states output by the layer norm after the ViT encoder blocks.
[1] https://github.com/huggingface/transformers/blob/main/src/tr...
Even undigitized materials aren't safe any more.