undefined | Better HN

0 pointslmeyerov4mo ago0 comments

Curious what kinds of evals you focus on?

We're finding investigating to be same-but-different to coding. Probably the most close to ours that has a bigger evals community is AI SRE tasks.

Agreed wrt all these things being contextual. The LLM needs to decide whether to trigger tools like self-planning and todo lists, and as the talk gives examples of, which kind of strategies to use with them.

0 comments

veselin4mo ago

I am taking for SWE bench style problems where Todo doesn't help, except for more parallelism.

lmeyerovOP4mo ago

Was guessing that, coding tasks are a valuable but myopic lense :)

I'm guessing a self-updating plan there is sufficient. I'm not actually convinced today's current plan <> todolist flow makes sense - in the linked PLAN.md, it gets unified, and that's how we do ai coding. I don't have evals on this, but from a year of vibes coding/engineering, that's what we experientially reached across frontier coding models & tools. Nowadays we're mixing in evals too, but that's a more complicated story.

j / k navigate · click thread line to collapse

0 comments

veselin4mo ago

I am taking for SWE bench style problems where Todo doesn't help, except for more parallelism.

lmeyerovOP4mo ago

Was guessing that, coding tasks are a valuable but myopic lense :)

j / k navigate · click thread line to collapse