undefined | Better HN

0 pointspertymcpert4mo ago0 comments

The models are non-deterministic. You can't just assume that because it did better before that it was on average better than before. And the variance is quite large.

0 comments

zsoltkacsandi4mo ago

No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.

baq4mo ago

This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).

zsoltkacsandi4mo ago

I did not say that I only ran the prompt once per attempt. When I say that second time it failed it means that I spent hours to restart, clear context, giving hints, everything to help the model to produce something that works.

1 more reply

ewoodrich4mo ago

I was pretty disappointed to learn that the METR metric isn't actually evaluating a model's ability to complete long duration tasks. They're using the estimated time a human would take on a given task. But it did explain my increasing bafflement at how the METR line keeps steadily going up despite my personal experience coding daily with LLMs where they still frequently struggle to work independently for 10 minutes without veering off task after hitting a minor roadblock.

  On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours. This allows us to characterize the abilities of a given model by “the length (for humans) of tasks that the model can successfully complete with x% probability”.

  For each model, we can fit a logistic curve to predict model success probability using human task length. After fixing a success probability, we can then convert each model’s predicted success curve into a time duration, by looking at the length of task where the predicted success curve intersects with that probability.

[1] https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

2 more replies

ACCount374mo ago

There are many, many tasks that a given LLM can successfully do 5% of the time.

Feeling lucky?

j / k navigate · click thread line to collapse

0 comments

zsoltkacsandi4mo ago

No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.

baq4mo ago

zsoltkacsandi4mo ago

1 more reply

ewoodrich4mo ago

  On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours. This allows us to characterize the abilities of a given model by “the length (for humans) of tasks that the model can successfully complete with x% probability”.

  For each model, we can fit a logistic curve to predict model success probability using human task length. After fixing a success probability, we can then convert each model’s predicted success curve into a time duration, by looking at the length of task where the predicted success curve intersects with that probability.

[1] https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...

2 more replies

ACCount374mo ago

There are many, many tasks that a given LLM can successfully do 5% of the time.

Feeling lucky?

j / k navigate · click thread line to collapse