undefined | Better HN

0 pointsjillesvangurp5h ago0 comments

Regardless of what model you use, agentic coding tools are indeed pretty good at finding issues if you target them a bit. And they have no respect for their own code or any sense of shame. So, you can just point them at their own code with a new thread.

Many AI models seem biased to cutting corners by default when generating code, even when you ask them not to. But a few simple follow up prompts can address that. Simply ask for covering corner cases with tests, test all the known non happy paths, look for weaknesses, verify adherence to SOLID principles, do security audits, etc. It will find issues. With bigger projects, you can actually make it file those issues in gh with labels and priorities. And then you can make it iterate on fixing issues with separate PRs.

On a recent project, I made it implement a simple benchmark test for measuring throughput. I had a hunch it was doing very sub optimal things. I then asked it to look for potential performance bottlenecks and use the benchmark to verify improvements. At that point I already had a lot of end to end tests to verify correctness. So, these performance tweaks were relatively low risk. I got about two orders of magnitude improvement and a lot more graceful behavior when pushed to the limit.

If you have a bit of experience engineering systems, just treat these tools like they are junior developers. Competent but likely to skip some essential steps. So, just double check with a lot pointed questions "did you do X? If not, do it now". Anything that needs repeated asking, turn it into a guard rail / skill.

There's a bit of effort and skill involved with this. I imagine a lot of less experienced developers might struggle to get good results because they aren't asking for the right things.

0 comments

bonoboTP3h ago

My problem is that it "finds issues" all the time and it never really ends. You go through the list, make a decision on how to go about it, give it back to the AI, it does the changes, you ask for issues again, there are now new issues in part due to the solutions from the previous fixes, now you again assess each issue and it's often valid but you have to ask yourself if it's worth fixing right now and whether the fix is worth the complexity for a super rare edge case, depending on the type of prpgram you make, and often the assessment of what's high or low priority is not great by the AI.

So to me this loop really never properly ends so it never feels like I'm done. Which is not great from a psychological point of view.

KronisLV5h ago

> Regardless of what model you use, agentic coding tools are indeed pretty good at finding issues if you target them a bit. And they have no respect for their own code or any sense of shame. So, you can just point them at their own code with a new thread. Many AI models seem biased to cutting corners by default when generating code, even when you ask them not to. But a few simple follow up prompts can address that.

That's more or less all of them, they do just generate the likely combinations of tokens, there is no critical thought involved. If you want to approximate that, review iterations are probably the right way to go about it, without the full conversation context either so there's no model output like "I'm doing X because it seems like the correct way to go about Y." but rather a fresh context which allows for more critical predictions.

Here's what works for me, can be made into a skill in whatever you use:

  I would like you to do a review loop!
  
  How this works:
  * once implementation is done, all tools must be run and pass: whatever is configured in the project like Ruff, Oxlint and Oxfmt, depending on the tech stack (also don't run such tools directly, look at package.json or similar project files/configurations/run scripts first; like if it's a stack that has compilation, compile the app, if there are tests, then run those; just know that you DO NOT generally need to stand up the whole app); if there is a projectlint-rules folder then that means you probably should run ProjectLint as well (local tool, use projectlint --help or projectlint --docs, or better yet, look at whether package.json or README.md have any instructions on how to run it)
  * once all the code seems okay, you will run THREE parallel sub-agents for code review: each looking at ALL changed code (not each having a different sub-section) and looking for CRITICAL/SERIOUS issues (not nitpicks), with the goal of not missing anything and building consensus
  * whatever CRITICAL/SERIOUS issues are found, if you can confirm that they're real and not false positives, you will then fix and remember to run the tools after, after which you will do another review iteration, followed by a fix iteration if needed and so on
  * remember that the review and fix loop must END with an iteration of the review agents returning that there are no CRITICAL/SERIOUS issues - you cannot just do fixes and say that there is nothing remaining yourself (and also remember that the reviews are done when all of the tools pass, like when the code is linted and formatted etc.)
  * at the end, produce a summary post that has a table, the rows being iterations, the columns for each of the agents (A, B, C) showing FIX/OK and then a column called Iteration summary; the goal for this is to show a summary how many iterations it took and what was fixed, you can also include text alongside the table as normally

The ProjectLint references might need to be removed (replace with whatever higher level linting/architecture tools you have, if any), but that's the overall idea. It does use a LOT of tokens though, but almost always there's something to fix. Of course, the problem is that sometimes there will be nitpicks or the fixes themselves won't be fully okay, though in general this trends towards slightly better code, even with something like Opus 4.7.

jillesvangurpOP4h ago

This can backfire a bit on token usage where it gets a bit to trigger happy running expensive things for trivial changes. I tend to not use sub agents for this reason. I actually manage to cover most my needs on the 20$/month codex subscription. I might switch to the 200$ plan at some point. But right now I just need to be economical as our company is fairy resource constrained. That's also why I prefer Codex over Claude Code. It seems it gets the job done for less $. Another advantage is that it seems to have less need to have things like this spelled out in this level of detail.

Another thing is that unless you are doing really complicated stuff, you probably don't need the latest models running on high. I'm still on 5.4 medium with codex. I see very little reason to change that.

Part of agentic engineering is figuring out how to be economical with tokens and time. You can sacrifice one for the other of course. But there are diminishing returns as well where spending 10x more doesn't actually get you 10x more quality/results.

KronisLV2h ago

I just have the Anthropic 100 USD Max plan and it's enough for daily work - I sometimes do hit the 5 hour limits by mid day, but weekly ones usually cap out at around 80% or thereabout, even with this approach. I usually use xhigh, sometimes max, both still result in situations where I need to intervene plenty, not even on that complex use cases (some LLM stuff, mostly web based CRUD, some light data processing, integrations with Jira and GitLab, processing PDFs and so on, sometimes ML training and geospatial work, like the Sentinel-2 satellite data, nothing crazy).

If I had to pay per token, I'd probably look at DeepSeek. In general it feels like it's a bit early for the technology - either our software methods are wasteful, or the hardware hasn't caught up. To me, it appears that we often need to throw more tokens at these problems, not less, since otherwise it's just one-shot slop.

esperent5h ago

> once all the code seems okay, you will run THREE parallel sub-agents for code review: each looking at ALL changed code

I did some evals with a prompt like this when I had some subscription tokens to burn, a few months ago. I think using Opus 4.5. What I found was:

1. Running two subagents was somewhat useful

2. Running three started to get redundant

3. Any more than three was pointless (at least when using the same model)

However, even two were getting like 60% the same results.

Much, much more effective was splitting out into audits through different lenses:

* One looking for security issues

* One looking for whether the task was completed successfully

* One looking for performance issues

* One looking for contract/maintainability issues

* One looking at test coverage

Etc.

KronisLV2h ago

You can get reasonably close with fewer, however more agents give better signal: e.g. if 3/3 flag something as an issue, the outer one that orchestrates them can view it as something to give more attention to, whereas if it's just 1/3, then it probably begs more consideration. Ofc more doesn't always imply right.

j / k navigate · click thread line to collapse

0 comments

bonoboTP3h ago

So to me this loop really never properly ends so it never feels like I'm done. Which is not great from a psychological point of view.

KronisLV5h ago

Here's what works for me, can be made into a skill in whatever you use:

  I would like you to do a review loop!
  
  How this works:
  * once implementation is done, all tools must be run and pass: whatever is configured in the project like Ruff, Oxlint and Oxfmt, depending on the tech stack (also don't run such tools directly, look at package.json or similar project files/configurations/run scripts first; like if it's a stack that has compilation, compile the app, if there are tests, then run those; just know that you DO NOT generally need to stand up the whole app); if there is a projectlint-rules folder then that means you probably should run ProjectLint as well (local tool, use projectlint --help or projectlint --docs, or better yet, look at whether package.json or README.md have any instructions on how to run it)
  * once all the code seems okay, you will run THREE parallel sub-agents for code review: each looking at ALL changed code (not each having a different sub-section) and looking for CRITICAL/SERIOUS issues (not nitpicks), with the goal of not missing anything and building consensus
  * whatever CRITICAL/SERIOUS issues are found, if you can confirm that they're real and not false positives, you will then fix and remember to run the tools after, after which you will do another review iteration, followed by a fix iteration if needed and so on
  * remember that the review and fix loop must END with an iteration of the review agents returning that there are no CRITICAL/SERIOUS issues - you cannot just do fixes and say that there is nothing remaining yourself (and also remember that the reviews are done when all of the tools pass, like when the code is linted and formatted etc.)
  * at the end, produce a summary post that has a table, the rows being iterations, the columns for each of the agents (A, B, C) showing FIX/OK and then a column called Iteration summary; the goal for this is to show a summary how many iterations it took and what was fixed, you can also include text alongside the table as normally

jillesvangurpOP4h ago

KronisLV2h ago

esperent5h ago

> once all the code seems okay, you will run THREE parallel sub-agents for code review: each looking at ALL changed code

I did some evals with a prompt like this when I had some subscription tokens to burn, a few months ago. I think using Opus 4.5. What I found was:

1. Running two subagents was somewhat useful

2. Running three started to get redundant

3. Any more than three was pointless (at least when using the same model)

However, even two were getting like 60% the same results.

Much, much more effective was splitting out into audits through different lenses:

* One looking for security issues

* One looking for whether the task was completed successfully

* One looking for performance issues

* One looking for contract/maintainability issues

* One looking at test coverage

Etc.

KronisLV2h ago

j / k navigate · click thread line to collapse