undefined | Better HN

0 pointsvidarh22d ago0 comments

Yes, it's often faster if you sit around waiting. What I will do instead is prompt the AI to create various plans, do other stuff while they do, review and approve the plans, do other stuff while multiple plans are being implemented, and then review and revise the output.

And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it, but my time is more expensive than Claude's time, and so as long as I'm not sitting around waiting it's a net win.

0 comments

irthomasthomas22d ago

I do this too, but then you need some method to handle it, because now you have to read and test and verify multiple work streams. It can become overwhelming. In the past week I had the following problems from parallel agents:

Gemini running an benchmark- everything ran smoothly for an hour. But on verification it had hallucinated the model used for judging, invalidating the whole run.

Another task used Opus and I manually specified the model to use. It still used the wrong model.

This type of hallucination has happened to me at least 4-5 times in the past fortnight using opus 4.6 and gemini-3.1-pro. GLM-5 does not seem to hallucinate so much.

So if you are not actively monitoring your agent and making the corrections, you need something else that is.

vidarhOP22d ago

You need a harness, yes, and you need quality gates the agent can't mess with, and that just kicks the work back with a stern message to fix the problems. Otherwise you're wasting your time reviewing incomplete work.

irthomasthomas22d ago

Here is an example where the prompt was only a few hundred tokens and the output reasoning chain was correct, but the actual function call was wrong https://x.com/xundecidability/status/2005647216741105962?s=2...

1 more reply

jmalicki22d ago

Glancing at what it's doing is part of your multitasking rounds.

Also instead of just prompting, having it write a quick summary of exactly what it will do where the AI writes a plan including class names branch names file locations specific tests etc. is helpful before I hit go, since the code outline is smaller and quicker to correct.

That takes more wall clock time per agent, but gets better results, so fewer redo steps.

irthomasthomas22d ago

1 more reply

jplusequalt22d ago

>And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it

This is exactly the sort of future I'm afraid of. Where the people who are ostensibly hired to know how stuff works, out source that understanding to their LLMs. If you don't know how the system works while building, what are you going to when it breaks? Continue to throw your LLM at it? At what point do you just outsource your entire brain?

vidarhOP20d ago

There are many layers to "knowing how stuff works". What does your manager do when your code breaks?

> Continue to throw your LLM at it?

Increasingly, yes. If you have objective acceptance criteria, just putting the LLM in a loop with a quality gate tends to have it converge on a fix itself, the same way a human would. Not always, and not always optimally, but more and more often, and with cheaper and cheaper models.

I also tend to throw in an analysis stage where it will look at what went wrong and use that to add additional criteria for the next run.

jplusequalt16d ago

Do you feel no shame shipping code without understanding how any of it works?

port1122d ago

This sounds like one recipe for burnout, much like Aderal was making everyone code faster until their brain couldn’t keep up with its own backlog.

vidarhOP20d ago

If anything, it's the opposite. With a proper harness you stop having to spend so much energy reviewing every little intermediate step, and can focus on the higher level.

I'm actually working on a project now where the biggest problem I need to solve is that the verifier that reviews the test harness is too strict.

port1119d ago

I keep being told that a proper harness makes agents better, but no one has shown me exactly what is it that gives them such amazing results.

Yesterday Gemini burned 40 minutes trying to diagnose a failed Expo build and going into loops of changing the Podfile and re-running the build, when the issue was that Xcode needed updating (quick Google search for it).

But my comment on burnout stands. The lack of downtime and dynamic thinking modes (admin, planning, review, actual coding) seems like it would conspire to make you either cram out more work or disconnect from it. Both of these become dangerous, after a while.

(Information workers were productive 4–6 hours a day, and the economy did just fine.)

j / k navigate · click thread line to collapse

0 comments

irthomasthomas22d ago

Gemini running an benchmark- everything ran smoothly for an hour. But on verification it had hallucinated the model used for judging, invalidating the whole run.

Another task used Opus and I manually specified the model to use. It still used the wrong model.

This type of hallucination has happened to me at least 4-5 times in the past fortnight using opus 4.6 and gemini-3.1-pro. GLM-5 does not seem to hallucinate so much.

So if you are not actively monitoring your agent and making the corrections, you need something else that is.

vidarhOP22d ago

irthomasthomas22d ago

1 more reply

jmalicki22d ago

Glancing at what it's doing is part of your multitasking rounds.

That takes more wall clock time per agent, but gets better results, so fewer redo steps.

irthomasthomas22d ago

1 more reply

jplusequalt22d ago

>And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it

vidarhOP20d ago

There are many layers to "knowing how stuff works". What does your manager do when your code breaks?

> Continue to throw your LLM at it?

I also tend to throw in an analysis stage where it will look at what went wrong and use that to add additional criteria for the next run.

jplusequalt16d ago

Do you feel no shame shipping code without understanding how any of it works?

port1122d ago

This sounds like one recipe for burnout, much like Aderal was making everyone code faster until their brain couldn’t keep up with its own backlog.

vidarhOP20d ago

If anything, it's the opposite. With a proper harness you stop having to spend so much energy reviewing every little intermediate step, and can focus on the higher level.

I'm actually working on a project now where the biggest problem I need to solve is that the verifier that reviews the test harness is too strict.

port1119d ago

I keep being told that a proper harness makes agents better, but no one has shown me exactly what is it that gives them such amazing results.

(Information workers were productive 4–6 hours a day, and the economy did just fine.)

j / k navigate · click thread line to collapse