All of the image-gen models have this problem - look at the hands and faces in the generated images of people and there are often bizarre deformations.
It's fascinating because it's the opposite of how children learn to draw. They tend to think about the pieces that make a thing and then try to put all the pieces on paper, and they end up making a drawing that (for instance) looks nothing like a person but has two eyes, a nose, a mouth, etc. in roughly the right relation to each other. (They rarely draw ears though!) The child is thinking about "what makes a face a face" and then trying to represent it. The ML model is sort of distributing pixels in a probabilistic way that comes up with something very similar to the pixels in a sample image in its training set, superficially much better than a kids drawing and yet in some ways much worse upon close inspection.
Nothing really wrong with that and it's neat anyways, but seem to represent its own sort of strange bias in a sense.
E.g. in Stable Diffusion, from a single prompt like "modern residential interior, concrete, glass, wood", you can generate a vast number of images, using different seed values.
The images will have something in common due to the prompt.
The Tudor mansions are almost certainly there in the training data, but not being selected for.
Are all the _this x doesn't exist_ generators based on similar scripts, only the training datasets differ, or are they somewhat tuned for the specific domain?
I'm waiting for Zoning Boards, Planning Commissions, etc. to get their hands on that tech. - and (BIG ASK!) outright ban construction of anything which looks like it might have been generated by it.
It's less convincing than the other projects like this, wonder if it's a uniquely hard problem or a weakness of this particular implementation.