undefined | Better HN

0 pointscolumn4mo ago0 comments

"[a photoshopped picture of a dog with 5 legs]...please count the legs"

Meanwhile you could benchmark for something actually useful. If you're about to say "But that means it won't work for my use case of identifying a person on a live feed" or whatever, then why don't you test that? I really don't understand the kick people get of successfully tricking LLMs on non productive task with no real world application. Just like the "how many r in strawberry?", "uh uh uh it says two urh urh".. ok but so what? What good is a benchmark that is so far from a real use case?

0 comments

tngranados4mo ago

The point of benchmarking that is checking for hallucinations and overfitting. Does the model actually check the picture to count the legs or does it just see it's a dog and answer four because it knows dogs usually has four legs?

It's a perfectly valid benchmark and very telling.

columnOP4mo ago

Very telling of what?

nsingh24mo ago

Telling of where the boundary of competence is for these models. And to show that these models aren't doing what most expect them to be doing, i.e. not counting legs, and maybe instead inferring information based on the overall image (dogs usually have 4 legs) to the detriment of find grained or out-of-distribution tasks.

j / k navigate · click thread line to collapse