I'd say Dall-E 2 is a little more unified - they do have multiple networks, but they're trained to work together. The previous approaches I was talking about are a lot more like the microservices analogy. Someone published a model (called CLIP) that can say "how much does this image look like a sunset". Someone else published a totally different model (e.g. VQGAN) that can generate images (but with no way to provide text prompts). A third person figures out a clever way to link the two up - have the VQGAN make an image, ask CLIP how much it looks like a sunset, and use backpropagation to adjust the image a little, repeat until you have a sunset. Each component is it's own thing, and VQGAN and CLIP don't know anything about one another.