undefined | Better HN

0 pointskrick4y ago0 comments

While the whole narrative of your comment totally makes sense, I don't really see the difference between the two approaches, not on a conceptual level. You still needed to train this so called "prior" at some point (so, I'm also not sure if it's fair to call it a "prior"). I mean, the difference between your two descriptions seems to be the difference between descriptions (i.e., how you chose to name individual parts of the system), not the systems.

I'm not sure if I'm speaking clearly, I just don't understand, what's the difference between training "text encoding to an image" vs "text embedding to image embedding". In both cases you have some kind of "sunset" (even though it's obviously just a dot in a multi-dimension space, not the letters) on the left, and you try to maximize it when training the model to get either a image-embedding or a image straight away.

0 comments

1 comments · 1 top-level

Imnimo4y ago

Yeah, my comment didn't really do a good job of making clear that distinction. Obviously the details are pretty technical, but maybe I can give a high-level explanation.

The previous systems I was talking about work something like this: "Try to find me the image the looks like it most matches 'a picture of a sunset'. Do this by repeatedly updating your image to make it look more and more like a sunset." Well, what looks more like a sunset? Two sunsets! Three sunsets! But this is not normally the way images are produced - if you hire an artist to make you a picture of a bear, they don't endeavor to create the most "bear" image possible.

Instead, what an artist might do is envision a bear in their head (this is loosely the job of the 'prior' - a name I agree is confusing), and then draw that particular bear image.

But why is this any different? Who cares if the vector I'm trying to draw is a 'text encoding' or an 'image encoding'? Like you say, it's all just vectors. Take this answer with a big grain of salt, because this is just my personal intuitive understanding, but here's what I think: These encodings are produced by CLIP. CLIP has a text encoder and an image encoder. During training, you give it a text caption and a corresponding image, it encodes both, and tries to make the two encodings close. But there are many images which might accompany the caption "a picture of a bear". And conversely there are many captions which might accompany any given picture.

So the text encoding of "a picture of a bear" isn't really a good target - it sort of represents an amalgamation of all the possible bear pictures. It's better to pick one bear picture (i.e. generate one image embedding that we think matches the text embedding), and then just to try to draw that. Doing it this way, we aren't just trying to find the maximum bear picture - which probably doesn't even look like a realistic natural image.

Like I said, this is just my personal intuition, and may very well be a load of crap.

1 more reply

j / k navigate · click thread line to collapse