You also need consistently labelled data so that the model can have a chance to learn the differences properly.
I've also seen the image models not understand context, so if you ask for e.g. "green eyes" then it will often place the image in grass/a green background, select green clothes, etc. -- i.e. it is only learning the association of the colour and not the association to a particular facial feature.
The image models are very bad at feature shifting and not understanding how features combine -- resulting in things like multiple arms because two of the images it is splicing have the arms in different positions.