I agree, that could explain a lot of it. I also suspect that the length of the generated code plays a role. In my experience, LLMs sometimes peter out a bit and give up if the generated program gets too long, even if it's well within their context limit. (Where giving up means writing a comment that says "the rest of the implementation goes here" or starting to have consistency issues.) Python and JavaScript tend to be more succinct and so that issue probably comes up less.
Yes, you have figured it out. LLMs are terrible for graphics programming. Web development - much better. Sonnet 3.5 is the only good model around for now. GPT 4o is very poor.