Claude 4.5: not overfitted too much -- does the right thing 6/10 times.
Claude 4.6: overfitted -- does the right thing 2/10 times.
OpenAI 5.3: overfitted -- does the right thing 3/10 times.
These aren't perfect benchmarks, but it lets me know how much babysitting I need to do.
My point being that older Claude models weren't overfitted nearly as much, so I'm confirming what you're saying.