>The most striking row is user prompts: 5,608 in February vs 5,701 in March. The human put in the same effort. But the model consumed 80x more API requests and 64x more output tokens to produce demonstrably worse results.
Unfortunately, LLM performance isn't an exact science and some observations are going to be subjective. Observations like ChatGPT being "lazy" in the Winter. Wanting to form opinions based on hard data, aka science, and not vibes is entirely reasonable but doesn't make the vibes a figment of imagination. Or as Jeff Bezos put it, "When the data and the anecdotes disagree, the anecdotes are usually right." And while he's not a scientist, his success does put some weight behind that quote. (as does digging deeper in what he meant by that.)