I recently compared a class that I wrote for a side project that had quite horrible temporal coupling for a data processor class.
Gemini - ends up rating it a 7/10, some small bits of feedback etc
Claude - Brutal dismemberment of how awful the naming convention, structure, coupling etc, provides examples how this will mess me up in the future. Gives a few citations for python documentation I should re-read.
ChatGPT - you're a beautiful developer who can never do anything wrong, you're the best developer that's ever existed and this class is the most perfect class i've ever seen
I haven't looked back. I just use Claude at home and ChatGPT at work (no Claude). ChatGPT at work is much worse than Claude in my experience.
I've noticed ChatGPT is rather high in its praise regardless of how valuable the input is, Gemini is less placating but still largely influenced by the perspective of the prompter, and Claude feels the most "honest" but humans are rather easy poor at judging this sort of thing.
Does anyone know if "sycophancy" has documented benchmarks the models are compared against? Maybe it's subjective and hard to measure, but given the issues with GPT 4o, this seems like a good thing to measure model to model to compare individual companies' changes as well as compare across companies.
Claude also seems a lot better at picking up what's going on. If you're focused on tasks, then yeah, it's going to know you want quick answers rather than detailed essays. Could be part of it.
Here's what I use:
WE ARE PROFESSIONALS. DO NOT FLATTER ME. BE BLUNT AND FORTHRIGHT.