undefined | Better HN

0 pointsjessermejia1mo ago0 comments

There's an interesting analysis here: https://github.com/anthropics/claude-code/issues/42796

>The most striking row is user prompts: 5,608 in February vs 5,701 in March. The human put in the same effort. But the model consumed 80x more API requests and 64x more output tokens to produce demonstrably worse results.

0 comments

2 comments · 1 top-level

mh-1mo ago· 1 in thread

Sorry, "this" referred to the parent comment's claim.

> models starting becoming "moody" due to their proprietors arbitrarily modifying their performance capabilities

The tokenizer changes are measurable, the above is quite difficult to quantify.

There are a few sites floating around that purport to, but all of them have fatal flaws in their methodology.

fragmede1mo ago

Unfortunately, LLM performance isn't an exact science and some observations are going to be subjective. Observations like ChatGPT being "lazy" in the Winter. Wanting to form opinions based on hard data, aka science, and not vibes is entirely reasonable but doesn't make the vibes a figment of imagination. Or as Jeff Bezos put it, "When the data and the anecdotes disagree, the anecdotes are usually right." And while he's not a scientist, his success does put some weight behind that quote. (as does digging deeper in what he meant by that.)

j / k navigate · click thread line to collapse