undefined | Better HN

0 pointsdekhn4mo ago0 comments

Using a single custom benchmark as a metric seems pretty unreliable to me.

Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.

0 comments

prodigycorp4mo ago

after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run.

This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se.

While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted.

my bad to the google team for the cursory brush off.

chermi4mo ago

Walks are magical. But also this reads partially like you got sent to a reeducation camp lol.

1 more reply

nomel4mo ago

> This probably means my test is a little too niche.

> my python one needs to be down weighted or supplanted.

To me, this just proves your original statement. You can't know if an AI can do your specific task based on benchmarks. They are relatively meaningless. You must just try.

I have AI fail spectacularly, often, because I'm in a niche field. To me, in the context of AI, "niche" is "most of the code for this is proprietary/not in public repos, so statistically sparse".

relaytheurgency4mo ago

I feel similarly. If you're working with some relatively niche APIs on services that don't get seen by the public, the AI isn't one-shotting anything. But I still find it helpful to generate some crap that I can then feel good about fixing.

agentcoops4mo ago

I definitely agree on the importance of personalized benchmarks for really feeling when, where and how much progress is occurring. The standard benchmarks are important, but it’s hard to really feel what a 5% improvement in X exam means beyond hype. I have a few projects across domains that I’ve been working on since ChatGPT 3 launched and I quickly give them a try on each new model release. Despite popular opinion, I could really tell a huge difference between GPT 4 and 5 , but nothing compared to the current delta between 5.1 and Gemini 3 Pro…

TLDR; I don’t think personal benchmarks should replace the official ones of course, but I think the former are invaluable for building your intuition about the rate of AI progress beyond hype.

lofaszvanitt4mo ago

No, do not share it. The bigger black hole these models are in, the better.

j / k navigate · click thread line to collapse