For a point of reference, I run a pretty comprehensive image model comparison site heavily weighted in favor of prompt adherence.
https://genai-showdown.specr.net
EDIT: FWIW, I agree with your assessment. OpenAI's models have always been very strong in prompt adherence but visually weak (gpt-image-1 had the famous "piss filter" until they finally pushed out gpt-image-1.5)