More specifically from that link:
> [...] the image is represented using 1024 tokens with a vocabulary size of 8192.
> The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE1 that we pretrained using a continuous relaxation.
OpenAI also provides the encoder and decoder models and their weights.
However, with the decoder model, it's now possible to say train a text-encoding model to link up to that decoder (training on say an annotated image dataset) to get something close to the DALL-E demo OpenAI posted. Or something even better!
Some brilliant folks (Ryan Murdock [@advadnoun], Phil Wang [@lucidrains]) have tried to replicate their results with projects like big-sleep [0] with decent results, but even with this improved VAE we're still a ways from DALL-E quality results.
If anyone would like to play with the model check out either the Google Colab [1] (if you wanna run it on Google's cloud) or my site [2] (if you want a simplified UI).
[0]: https://github.com/lucidrains/big-sleep/
[1]: https://colab.research.google.com/drive/1MEWKbm-driRNF8PrU7o...
[2]: https://dank.xyz
What's DALL·E.? What's a VAE?
A post to hacker news doesn't target a specific niche community.
One of the examples that was shown was making a chair that looks like an avocado. It produced a number of such images.
It seems rather impressive imo.
_____
A VAE is a "variational auto-encoder". aiui, a VAE is a neural net which has an encoder part, and a decoder part, where the encoder part has input which is the full thing (in this context, the picture, (or maybe small square in a picture?)), and this is also the space that the output of the decoder has. The space for the output of the encoder is much smaller than the other side.
Err, let me rephrase that.
There's a high dimensional space, like the space of possible pictures, but you want to reduce it to a low dimensional space corresponding to the sort of pictures in your dataset, so you have the neural net take in a picture, map it into some low dimensional space (this is the encoder), and then map it back to the original high dimensional space, trying to make it so that the output it gets out after doing that is as close as it can get to the input that went into the encoder. Once you've gotten that working well, you can just grab random locations in the low dimensional space, and it will look like a picture of the same type that came from your dataset?
uh, I'm simplifying as a result of not knowing all the details myself.
That said there is a completely implementation made by lucidrains[1] with some results, the only missing component now is the dataset.
Note that these just demonstrate that arbitrary encoded input images match the decoded images, which is what would be expected from a VAE.