undefined | Better HN

0 pointsrcme3y ago0 comments

I bet they use CLIP to caption the image and feed the text of the caption into GPT, but that's just a guess.

0 comments

Did you check all of the samples provided? It can read an entire research paper and understand the figures just from the images of the papers pages. This seems to be a much deeper connection than extracting captions.

ionwake3y ago

Are you sure? Sounds too epic

EMM_3863y ago

See the real examples for yourself, starting on page 34 ... mind-blowing.

https://cdn.openai.com/papers/gpt-4.pdf

1 more reply

wpnbos3y ago

It's SOTA on DocVQA[1] so yeah it is able to read text/graphs/tables from images

[1] https://www.docvqa.org/

gwern3y ago

CLIP doesn't do captioning, it just generates embeddings. And it's contrastive, so it would work poorly for this kind of task: anything 'relational' falls apart immediately. (See for example the DALL-E 2 results for these kinds of captions/tasks.)

It's almost certainly a VQ-VAE-style encoding of the image itself into a sequence of tokens, as was done by DALL-E 1, CM3, Gato and a whole bunch of more recent models. It's the very obvious thing to do, and their context window is more than large enough now.

GaggiX3y ago

This way the model would also be able to generate images, I would also be curious how they handle images with different aspect ratios (and maybe resolution so it can read well on papers).

_hl_3y ago

There's no need to round-trip through text, you "just" need to train an embedding space that captures both domains.

joshvm3y ago

You can look at Google's recent PaLM-E model for a possible approach. They use a vision transformer to tokenise the image (or to generate embeddings and then tokenise those?) and they also tokenise detected objects so the model can reason at a semantic level. Either way, it's been shown that these massive LLMs can handle images in tokenised form if you pretend it's text. In Google's case, the model is trained to look for sentinel values in the prompt (i.e. <img>) that denote images/objects are being sent.

sebzim45003y ago

They almost certainly generate tokens directly from the image. It would be extremely hard to generate short english descriptions which sufficiently describe the images to pass some of those benchmarks.

j / k navigate · click thread line to collapse