Yeah, my comment didn't really do a good job of making clear that distinction. Obviously the details are pretty technical, but maybe I can give a high-level explanation.
The previous systems I was talking about work something like this: "Try to find me the image the looks like it most matches 'a picture of a sunset'. Do this by repeatedly updating your image to make it look more and more like a sunset." Well, what looks more like a sunset? Two sunsets! Three sunsets! But this is not normally the way images are produced - if you hire an artist to make you a picture of a bear, they don't endeavor to create the most "bear" image possible.
Instead, what an artist might do is envision a bear in their head (this is loosely the job of the 'prior' - a name I agree is confusing), and then draw that particular bear image.
But why is this any different? Who cares if the vector I'm trying to draw is a 'text encoding' or an 'image encoding'? Like you say, it's all just vectors.
Take this answer with a big grain of salt, because this is just my personal intuitive understanding, but here's what I think: These encodings are produced by CLIP. CLIP has a text encoder and an image encoder. During training, you give it a text caption and a corresponding image, it encodes both, and tries to make the two encodings close. But there are many images which might accompany the caption "a picture of a bear". And conversely there are many captions which might accompany any given picture.
So the text encoding of "a picture of a bear" isn't really a good target - it sort of represents an amalgamation of all the possible bear pictures. It's better to pick one bear picture (i.e. generate one image embedding that we think matches the text embedding), and then just to try to draw that. Doing it this way, we aren't just trying to find the maximum bear picture - which probably doesn't even look like a realistic natural image.
Like I said, this is just my personal intuition, and may very well be a load of crap.