undefined | Better HN

0 pointswithinboredom27d ago0 comments

But the ideas are not 'new'. A benchmark that I use to tell me if an AI is overfitted is to present the AI with a recent paper (especially one like a paxos variant) and have it build that. If it writes general paxos instead of what the paper specified, its overfitted.

Claude 4.5: not overfitted too much -- does the right thing 6/10 times.

Claude 4.6: overfitted -- does the right thing 2/10 times.

OpenAI 5.3: overfitted -- does the right thing 3/10 times.

These aren't perfect benchmarks, but it lets me know how much babysitting I need to do.

My point being that older Claude models weren't overfitted nearly as much, so I'm confirming what you're saying.

0 comments

Kim_Bruning27d ago

Could also be that the model has stronger priors wrt Paxos (and thus has Opinions on what good Paxos should look like)

At any rate, with an assembler, you end up with a lot of random letter-salad mnemonics with odd use cases, so that is very likely to tokenize in interesting ways at the very least.

withinboredomOP27d ago

I was just using paxos as an example. Any paper will do.

j / k navigate · click thread line to collapse