Agentic Engineering Patterns (opens in new tab)

(simonwillison.net)

543 pointsr4um2mo ago310 comments

310 comments

Simon, if you're reading this, I'd be really curious to hear your thoughts on how to effectively conduct code reviews in a world where "code is cheap".

One of the biggest struggles I have on my team is coworkers straight up vibing parts of the code and not understanding or guiding the architecture of subsystems. Or at least, not writing code in a way that is meant to be understood by others.

Then when I go through the code and provide extensive feedback (mostly architectural and highlighting odd inconsistencies with the code additions) I'm met with much pushback because "it works, why change it"? Not to mention the sheer size of prs ballooning in recent months.

The end result is me being the bottleneck because I can't keep up with the "pace" of code being generated, and feeling a lot of discomfort and pressure to lower my standards.

I've thought about using a code review agent to review and act as me in proxy, but not being able to control the exact output worries me. And I don't like the lack of human touch it provides. Maybe someone has advice on a humane way to handle this problem.

simonw2mo ago

This is genuinely one of the most interesting questions right now. I don't have solid answers yet, and I'm very keen to learn what people are finding works.

If you accelerate the pace of code creation it inevitably creates bottlenecks elsewhere. Code review is by far the biggest of those right now.

There may be an argument for leaning less on code review. When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?

But I don't want to lower my standards! I want the code I'm producing with coding agents to be better than the code I would produce without them.

There are some aspects of code review that you cannot skimp on. Things like coding standards may not matter as much, but security review will never be optional.

I've recently been wondering what we can learn from security teams at large companies. Once you have dozens or hundreds of teams shipping features at the same time - teams with varying levels of experience - you can no longer trust those teams not to make mistakes. I expect that the same strategies used by security teams at Facebook/Google-scale organizations could now be relevant to smaller organizations where coding agents are responsible for increasing amounts of code.

Generally though I think this is very much an unsolved problem. I hope to document the effective patterns for this as they emerge.

malexw2mo ago

I think Martin Fowler's "Refactoring" might give a bit of insight here. One of my take-aways after reading that book is that the specific implementation of a function is not very important if you have tests. He argues that it can sometimes be easier to completely re-write a function than to take the time to understand it - as long as you can validate that your re-write performs the same way. This mindset lines up pretty closely with how I've been using LLMs.

If that's true, then I would think the emphasis in code review should be more on test quality and verifying that the spec is captured accurately, and as you suggest, the actual implementation is less important.

DanHulton2mo ago

This is why I've been pushing back on the "just have the AI generate the tests!" mentality. Sure, let it help you, but those tests are the guarantee of quality and fit for purpose. If you vibe code them, how the hell do you know if it even does what you think it does?

You should be planning out the tests to properly exercise the spec, and ensuring those tests actually do what the spec requires. AI can suggest more tests (but be careful here, too, because a ballooned test suite slows down CICD), but it should never be in charge of them completely.

1 more reply

ep1032mo ago

Counter-point, developers that get used to not caring about function implementation, are going to culturally also not care as much about test implementation, making this proposed ideal impossible.

1 more reply

MattGrommes2mo ago

A related book I've been thinking about in terms of LLMs is "Working Effectively With Legacy Code". I'd love to be able to work a lot of that advice into some kind of Skill or customized agent to help with big refactors.

1 more reply

fhd22mo ago

That's my experience with agentic development so far, a lot of extra time goes into testing.

Problem is, the way I've been trained to test isn't exactly antagonistic. QA does that kind of thing. Programmers writing tests are generally rather doing spot checks that only make sense if the code is generally understood and trustworthy. Code LLMs produce is usually broken in subtle, hard to spot ways.

cma2562mo ago

> There may be an argument for leaning less on code review. When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?

Agree with everything else you said except this. In my opinion, this assumes code becomes more like a consumable as code-production costs reduce. But I don't think that's the case. Incorrect, but not visibly incorrect, code will sit in place for years.

simonw2mo ago

> Agree with everything else you said except this.

Yeah, I'm not sure I agree with what I said there myself!

> Incorrect, but not visibly incorrect, code will sit in place for years.

If you let incorrect code sit in place for years I think that suggests a gap in your wider process somewhere.

I'm still trying to figure out what closing those gaps looks like.

The StrongDM pattern is interesting - having an ongoing swarm of testing agents which hammer away at a staging cluster trying different things and noting stuff that breaks. Effectively an agent-driven QA team.

I'm not going to add that to the guide until I've heard it working for other teams and experienced it myself though!

2 more replies

esafak2mo ago

It assumes that bugs are rare and easy to fix. A look at Claude Code's issue tracker (https://github.com/anthropics/claude-code/issues) tells you that this is not so. Your product could be perpetually broken, lurching from one vibe coded bug to another.

atomicUpdate2mo ago

> When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?

I don't care how cheap it is to replace the incorrect code when it's modifying my bank account or keeping my lights on.

pixl972mo ago

Oh, don't worry, even before AI the companies in question were already outsourcing a lot of this to the cheapest companies they could find. We are just very very lucky most of the problems incurred get caught before being foisted on the wider world.

benmathes2mo ago

One model I've seen is moving the review stage to the designs, not the code itself.

I.e. have a `planning/designs/unbuilt/...` folder that contains markdown descriptions of features/changes that would have gotten a PR. Now do the review at the design level.

keithnz2mo ago

Agent based code reviews is what you want. But you have to do set it up with really good context about what is wanted. You then review the reviews, keep improving the context it is working with. Make sure it's put into everyone's global context they work with as well.

Weirdly this article doesn't really talk about the main agentic pattern

- Plan (really important to start with a plan before code changes). iteratively build a plan to implement something. You can also have a colelctive review of the plan, make sure its what you want and there is guidance about how it should implement in terms of architecture (should also be pulling on pre existing context about your architecure /ccoding standards), what testing should be built. Make sure the agent reviews the plan, ask the agent to make suggestions and ask questions

- Execute. Make the agent (or multiple agents) execute on the plan

- Test / Fix cycle

- Code Review / Refactor

- Generate Test Guidance for QA

Then your deliverables are Code / Feature context documentation / Test Guidance + evolving your global/project context

simonw2mo ago

I'm still trying to figure out how to write about planning.

The problem is Claude Code has a planning mode baked in, which works really well but is quite custom to how Claude Code likes to do things.

When I describe it as a pattern I want to stretch a little beyond the current default implementation in one of the most popular coding agents.

jgraettinger12mo ago

Maintaining a high-quality requirements / specification document for large features prior to implementation, and then referencing it in "plan mode" prompts, feels like consensus best practice at this stage.

However a thing I'm finding quite valuable in my own workflows, but haven't seen much discussion of, is spending meaningful time with AI doing meta-planning of that document. For example, I'll spend many sessions partnered with AI just iterating on the draft document, asking it to think through details, play contrarian, surface alternatives, poke holes, identify points of confusion, etc. It's been so helpful for rapidly exploring a design space, and I frequently find it makes suggestions that are genuinely surprising or change my perspective about what we should build.

I feel like I know we're "done" when I thoroughly understand it, a fresh AI instance seems to really understand it (as evaluated by interrogating it), and neither of us can find anything meaningful to improve. At that point we move to implementation, and the actual code writing falls out pretty seamlessly. Plus, there's a high quality requirements document as a long-lived artifact.

Obviously this is a heavyweight process, but is suited for my domain and work.

ETA one additional practice: if the agent gets confused during implementation or otherwise, I find it's almost always due to a latent confusion about the requirements. Ask the agent why it did a thing, figure out how to clarify in the requirements, and try again from the top rather than putting effort into steering the current session.

1 more reply

eterps2mo ago

You could have a look at: https://github.com/jurriaan/aico

It does 2 things that are very important, 1: reviewing should not be done last, but during the process and 2: plans should result into verifyable specs, preferably in a natural language so you can avoid locking yourself into specific implementation details (the "how") too early.

ramoz2mo ago

> what testing should be built

Yea, a big part of my planning has included what verification steps will be necessary along the way or at the end. No plan gets executed without that and I often ask for specific focus on this aspect in plan mode.

keithnz2mo ago

yeah, spending a bunch of time with the plan is really worthwhile, nearly all aspects of the plan are worth a bunch of attention. Getting it to think about edge cases and all the scenarios for testing is really worthwhile, what can be automated, what manual testing should be done. It's often working through testing scenarios that I often see gaps in the plan.

esafak2mo ago

Code review should be mandatory and reviewers should ask big PRs to be broken up, and its submitters to be able to defend every line of code. For when the computer is generating the code, the most important duty of the submitter is to vouch for it. To do otherwise creates the bad incentive of making others do all your QA, and nobody is going to be rewarded for that.

simonw2mo ago

Yeah, I think that's one of the biggest anti-patterns right now: dumping thousands of lines of agent-generated code on your team to review, which effectively delegates the real work to other people.

simonw2mo ago

I just added a chapter which touches on that: https://simonwillison.net/guides/agentic-engineering-pattern...

xXSLAYERXx2mo ago

> Code review should be mandatory and reviewers should ask big PRs to be broken up

Always, even before all this madness. It sounds more like a function of these teams CR process rather than agents writing the code. Sometimes super large prs are necessary and I've always requested a 30 minute meeting to discuss.

I don't see this as an issue, just noise. Reduce the PR footprint. If not possible meet with the engineer(s)

mistercheese2mo ago

> the most important duty of the submitter is to vouch for it

When shipping pressure comes, I’ve seen this to be the first thing to go. Despite formalizing ownership standards, etc… people on both the submitting and reviewing end just give up understanding Ai slop when management says they need to hit a deadline.

Probably no company would actually do this, but I wonder if we should actually actively test the submitter’s understanding of the code submitted somehow as a prerequisite to moving a PR to ready for review. I’m not sure if it will be actually hopeful, enforcing people to understand the code, but maybe at least we’ll put the cultural expectation upfront and center?

mightybyte2mo ago

One plausible future I can see from here is that we see a shift in our relationship to code in high-level languages that is similar to what happened with code written in assembly language back when the first high level languages were introduced. Before them, software engineers operated in assembly language. They cared about the structure of assembly code. This happened before I started my professional software career, but I can imagine that a lot of the same things we are hearing from developers today were heard back then. Concern about devs producing code they didn't understand, the generated assembly not being meant to be understood by others, etc etc.

Now, however, we know how that played out in the case of assembly language. The fact of the matter is that only a very tiny fraction of software engineers give the structure of the compiled assembly code even passing thought. Our ability to generate assembly code is so great that we don't care about the end result. We only care about its properties...i.e. that it runs efficiently enough and does what we want. I could easily see the AI software development revolution ending up the same way. Does it really matter if the code generated by AI agents is DRY and has good design if we can easily recreate it from scratch in a matter of minutes/hours? As much as I love the craft and process of creating a beautiful codebase, I think we have to seriously consider and plan for a future where that approach is dramatically less efficient than other AI-enabled approaches.

pc862mo ago

"It works, why change it?" is a horrible attitude but is an organizational and interpersonal problem, not a technical one. They're only 1/3 of the way done according to Kent Beck.¹

There are plenty of orgs using AI who still care about architecture and having easily human-readable, human-maintainable code. Maybe that's becoming an anachronism, and those firms will go the way of the Brontosaurus. Maybe it will be a competitive advantage. Who knows?

¹ "Make it work, make it right, make it fast."

manquer2mo ago

The way I am handing this - investing heavily in static and dynamic analysis aspects.

- A lot more linting rules than ever before, also custom rule sets that do more org and project level validations.

- Harder types enforcement in type optional languages , Stronger and deeper typing in all of them .

- beyond unit tests - test quality coverage tooling like mutation testing(stryker) and property based testing (quickcheck) if you can go that precise

- much more dx scripts and build harnesses that are specific to org and repo practices that usually junior/new devs learn over time

- On dynamic side , per pull requests environments with e2e tests that agents can validate against and iterate when things don’t work.

- documentation generation and skill curation. After doing a batch of pull requests reviews I will spend time in seeing where the gaps are in repo skills and agents.

All this becomes pre-commit heavy, and laptops cannot keep up in monorepos, so we ended up doing more remote containers on beefy machines and investing and also task caching (nx/turborepo have this )

Reviews (agentic or human) have their uses , but doing this with reviews is just high latency, inefficient and also tends to miss things and we become the bottleneck.

Earlier the coder(human or agent) gets repeatable consistent feedback it is better

yonaguska2mo ago

Can you document the hard architectural requirements of your codebase? And keep it up to date? If you can do that, you can force your coworkers to always use those requirements during their prompting /planning for their implementations and you can feed that to an agent and have that review the code.

But more proactively, if people aren't going to write their own code, I think there needs to be a review process around their prompts, before they generate any code at all. Make this a formal process, generate the task list you're going to feed to your LLM, write a spec, and that should be reviewed. This is not a substitute for code reviews, but it tends to ensure that there are only nitpick issues left, not major violations of how the system is intended to be architected.

ornornor2mo ago

I’m running into this problem as well with juniors slinging code that takes me a very long time to understand and review. I’m iterating on an AGENTS.md file to share with them because they aren’t going to stop using AI and I’m a little tied of always saying the same things (Claude loves to mock everything and assert that spies were called X times with Y arguments which is a great recipe for brittle tests, for example)

I know they won’t stop using AI so giving them a directives file that I’ve tried out might at least increase the quality of the output and lower my reviewing burden.

Open to other ideas too :)

esafak2mo ago

Have an AI reviewer take the first crack at it after pointing it to your rules file (e.g., AGENTS) so you don't have to repeat yourself. Gemini does this fairly well, for example. https://developers.google.com/gemini-code-assist/docs/review...

scuff3d2mo ago

In a business (or any large project setting) where there are real users and real risk involved, code can't move into a code base any faster then it can be reviewed by a human. Period. I apply the exact same standards to PRs for AI assisted code as I do for human written code. If the code is crap, the PR is too larger, or the dev can't explain it. it gets rejected. End of story. We are a long way away from the need for human review going away.

TeeWEE2mo ago

We make the creator of the PR responsible for the code. Meaning they must understand it.

Also, we only allow engineers to commit (agent generated) code. Designers just come up with suggestions, engineers take it and ensure it fits our architecture.

We do have a huge codebase. We are teaching Claude Code with CLAUDE.md's and now also <feature>.spec.md (often a summary of the implementation plan).

In the end, engineers are responsible.

layer82mo ago

> I'm met with much pushback because "it works, why change it"?

This is an educational problem, and is unlikely to be easy to fix in your team (though I might be wrong). I would suggest to change to a team or company with a culture that values being able to reason about one’s software.

jf222mo ago

How are the architecture changes you are proposing improving the end result?

>but not being able to control the exact output worries me

Why?

epolanski2mo ago

Fire them. Easy.

They have to be responsible for what they push.

fantasizr2mo ago

Code review is now a bit like Brandolini's law: "The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it." You ultimately need a lot of buy in to spend more than 5 mins on something that took 5 seconds to produce.

mistercheese2mo ago

Yes I thinks somehow we need a bulldog check gate before it even goes to a human reviewer

mohsen12mo ago

I've experimented with agentic coding/engineering a lot recently. My observation is that software that is easily tested are perfect for this sort of agentic loop.

In one of my experiments I had the simple goal of "making Linux binaries smaller to download using better compression" [1]. Compression is perfect for this. Easily validated (binary -> compress -> decompress -> binary) so each iteration should make a dent otherwise the attempt is thrown out.

Lessons I learned from my attempts:

- Do not micro-manage. AI is probably good at coming up with ideas and does not need your input too much

- Test harness is everything, if you don't have a way of validating the work, the loop will go stray

- Let the iterations experiment. Let AI explore ideas and break things in its experiment. The iteration might take longer but those experiments are valuable for the next iteration

- Keep some .md files as scratch pad in between sessions so each iteration in the loop can learn from previous experiments and attempts

[1] https://github.com/mohsen1/fesh

medi8r2mo ago

You have to have really good tests as it fucks up in strange ways people don't (because I think experienced programmers run loops in their brain as they code)

Good news - agents are good at open ended adding new tests and finding bugs. Do that. Also do unit tests and playwright. Testing everything via web driving seems insane pre agents but now its more than doable.

skapadia2mo ago

"Test harness is everything, if you don't have a way of validating the work, the loop will go stray"

This is the most important piece to using AI coding agents. They are truly magical machines that can make easy work of a large number of development, general purpose computing, and data collection tasks, but without deterministic and executable checks and tests, you can't guarantee anything from one iteration of the loop to the next.

theshrike792mo ago

Agents run tools in a loop.

The ability to test their work reliably is a tool, if you don't give them that, it's kinda silly to expect any kind of quality output.

benrutter2mo ago

I use AI in my workflow mostly for simple boilerplate, or to troubleshoot issues/docs.

I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).

I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.

Anyway, really looking forward to trying some if these patterns as the book develops to see if that makes a difference. Understanding how other peopke really use these tools is a big gap for me.

pkorzeniewski2mo ago

One thing I rarely see mentioned is that often creating code by hand is simply faster (at least for me) than using AI. Creating a plan for AI, waiting for execution, verifying, prompting again etc. can take more time than just doing it on my own with a plan in my head (and maybe some notes). Creating something from scratch or doing advanced refactoring is almost always faster with AI, but most of my daily tasks are bugs or features that are 10% coding and 90% knowing how to do it.

sothatsit2mo ago

> 10% coding and 90% knowing how to do it

I think this is the main point where many people’s work differs. Most of my work I know roughly what needs changing and how things are structured but I jump between codebases often enough that I can’t always remember the exact classes/functions where changes are needed. But I can vaguely gesture at those specific changes that need to be made and have the AI find the places that need changing and then I can review the result.

I rarely get the luxury of working in a single codebase for a long enough period of time to get so familiar with it that I can jump to particular functions without much thought. That means AI is usually a better starting point than me fumbling around trying to find what I think exists but I don’t know where it is.

JetSetIlly2mo ago

I've heard people say that these coding agents are just tools and don't replace the thinking. That's fine but the problem for me is that the act of coding is when I do my thinking!

I'm thinking about how to solve the problem and how to express it in the programming language such that it is easy to maintain. Getting someone/something else to do that doesn't help me.

But different strokes for different folks, I suppose.

scuff3d2mo ago

I'm similar, but I do find some natural places where LLMs can be helpful.

Just today I was working on something that involves a decent amount of configuration. It's in Python unfortunately and I hate passing around dictionaries for configs, I usually like to parse the JSON or YAML or whatever into a config class so I have a natural way to validate and access without just throwing strings around.

As I was playing with the code for the actual work that needs to be done, I was thinking what configs I needed and what structure made sense. Once I knew what I needed I gave the JSON to an LLM with some instructions regarding helper functions and told it to give me the appropriate Python code. It's just a bunch of dataclasses with some from_dict or from_string methods on them, not interesting or difficult to write. Freed me up to keep working on the real problem.

vidarh2mo ago

Yes, it's often faster if you sit around waiting. What I will do instead is prompt the AI to create various plans, do other stuff while they do, review and approve the plans, do other stuff while multiple plans are being implemented, and then review and revise the output.

And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it, but my time is more expensive than Claude's time, and so as long as I'm not sitting around waiting it's a net win.

irthomasthomas2mo ago

I do this too, but then you need some method to handle it, because now you have to read and test and verify multiple work streams. It can become overwhelming. In the past week I had the following problems from parallel agents:

Gemini running an benchmark- everything ran smoothly for an hour. But on verification it had hallucinated the model used for judging, invalidating the whole run.

Another task used Opus and I manually specified the model to use. It still used the wrong model.

This type of hallucination has happened to me at least 4-5 times in the past fortnight using opus 4.6 and gemini-3.1-pro. GLM-5 does not seem to hallucinate so much.

So if you are not actively monitoring your agent and making the corrections, you need something else that is.

2 more replies

jplusequalt2mo ago

>And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it

This is exactly the sort of future I'm afraid of. Where the people who are ostensibly hired to know how stuff works, out source that understanding to their LLMs. If you don't know how the system works while building, what are you going to when it breaks? Continue to throw your LLM at it? At what point do you just outsource your entire brain?

1 more reply

port112mo ago

This sounds like one recipe for burnout, much like Aderal was making everyone code faster until their brain couldn’t keep up with its own backlog.

1 more reply

badestrand2mo ago

For me it _can_ be faster to code than to instruct but it takes me significantly less effort to write the prompt than the actual code. So a few hours of concentrates coding leave me completely drained of energy while after a few hours with the agents I still have a lot of mental energy. That's the huge difference for me and I don't want to go back.

wreath2mo ago

Thats interesting. While i do get mentally tired after a session of focused coding, i feel like i have accomplished something. Using AI for coding feels similar to spending hours doom scrolling reels. Less engaging but Im drained as hell at the end.

1 more reply

thbb1232mo ago

My way of phrasing this: I need to activate my personal transformers on my inner embeddings space to really figure what is it that I truly want to write.

atrevbot2mo ago

I definitely agree with this and have experienced it as well. Having said that I wonder if the prevalence, and usefulness of AI will make those types of features fewer as intimate knowledge of the codebase decreases.

abm532mo ago

The rebuttal to this would be that you can do many such tasks in parallel.

I’m not sure it’s really true in practice yet, but that would certainly be the claim.

paganel2mo ago

But can you mentally "keep hold" (for lack of a better term) of those tasks that are getting executed in parallel? Honestly asking.

Because, after they're done/have finished executing, I guess you still have to "check" their output, integrate their results into the bigger project they're (supposedly) part of etc, and for me the context-switching required to do all that is mentally taxing. But maybe this only happens because my brain is not young enough, that's why I'm asking.

3 more replies

storus2mo ago

I delegate to agents what I hate doing, e.g. when creating a SaaS web app, the last thing I want to waste my time on is the landing page with about/pricing/login and Stripe integration frontend/backend - I'll just tell Claude Code (with Qwen3-Coder-Next-Q8 running locally on RTX Pro 6000) to make all this basic stuff for me so that I can focus on the actual core of the app. It then churns for half an hour, spews out the first version where I need to spend another half an hour to fix bugs by pointing errors to Claude Code and then in 1 hour it's all done. I can also tell it to avoid all the node.js garbage and do it all in plain HTML/JS/CSS.

ljlolel2mo ago

That’s why we won’t plan anymore or compile it’ll just execute https://jperla.com/blog/claude-electron-not-claudevm

fnands2mo ago

When was the last time you tried?

I think trying agents to do larger tasks was always very hit or miss, up to about the end of last year.

In the past couple of months I have found them to have gotten a lot better (and I'm not the only one).

My experience with what coding assistants are good for shifted from:

smart autocomplete -> targeted changes/additions -> full engineering

maccard2mo ago

I’m not OP but every time I post a comment with this sentiment I get told “the latest models are what you need”. If every 3 months you are saying “it’s ready as long as you use the latest model”, then it wasn’t ready 3 months ago and it’s not likely to be ready now.

To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.

anon70002mo ago

I don’t think that’s true. Claude Opus 4.5/4.6 in Cursor have marked the big shift for me. Before that, agentic development mostly made me want to just do it myself, because it was getting stuck or going on tangents.

I think it can (and is) shifting very rapidly. Everyone is different, and I’m sure models are better at different types of work (or styles of working), but it doesn’t take much to make it too frustrating to use. Which also means it doesn’t take much to make it super useful.

1 more reply

fendy30022mo ago

It depends on what you're handling. Frontend (not css), swagger, mundane CRUD is where it shines. Something more complex that need a bit harder calculation usually make the agents struggling.

Especially good to navigate the code if you're unfamiliar with it (the code). If you have known the code for good, you'll find it's usually faster to debug and code by yourself.

Opus 4.6 with claude code vscode extension

techpression2mo ago

Agree, it’s strange, I will just assume that the people who say this are building react apps. I still have so much ”certainly, I should not do this in a completely insane way, let me fix that” … -400+2. It’s not always, and it is better than it was, but that’s it.

1 more reply

sergiosgc2mo ago

Have you tried it with something like OpenSpec? Strangely, taking the time to lay out the steps in a large task helps immensely. It's the difference between the behavior you describe and just letting it run productively for segments of ten or fifteen minutes.

1 more reply

edgyquant2mo ago

I thought this too and then I discovered plan mode. If you just prompt agent mode it will be terrible, but coming up with a plan first has really made a big difference and I rarely write code at all now

1 more reply

fragmede2mo ago

At this point though, after Claude C Compiler, you've got to give us more details to better understand the dichotomy. What do you consider simple issues?

2 more replies

benrutter2mo ago

> When was the last time you tried?

Pretty recently (a couple weeks ago). I give agentic workflows a go every couple of weeks or so.

I should say, I don't find them abysmal, but I tend to work in codebases where I understand them, and the patterns really well. The use cases I've tried so far, do sort of work, just not yet at least, faster than I'm able to actual write the code myself.

darkwater2mo ago

> My experience with what coding assistants are good for shifted from:

> smart autocomplete -> targeted changes/additions -> full engineering

Define "full engineering". Because if you say "full engineering" I would expect the agent to get some expected product output details as input and produce all by itself the right implementation for the context (i.e. company) it lives in.

fnands2mo ago

I agree that "full engineering" was a bit broad. I should probably have said something like "agent-only coding"?

I.e. the point where the agent writes all the code and you just verify.

1 more reply

lelanthran2mo ago

> I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).

> I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.

It squares up just fine.

You ever read a blog post or comment and think "Yeah, this is definitely AI generated"? If you can recognise it, would you accept a blog post, reviewed by you, for your own blog/site?

I won't; I'll think "eww" and rewrite.

The developers with good AI experiences don't get the same "eww" feeling when reading AI-generated code. The developers with poor AI experiences get that "eww" feeling all the time when reviewing AI code and decide not to accept the code.

Well, that's my theory anyway.

jghn2mo ago

I also will rewrite both text and code created by Gen AI. I've found the best workflow for me is not to refine what I've written, but instead to use it to help me get over humps and/or crank through some of the drudgery. And then I go back and edit, fixing any issues I spot and to reshape it to be in my own voice.

I do this with code too.

panstromek2mo ago

> It feels a little tricky to square these up sometimes.

In my experience, this heavily depends on the task, and there's a massive chasm between tasks where it's a good and bad fit. I can definitely imagine people working only on one side of this chasm and being perplexed by the other side.

adampunk2mo ago

>It feels a little tricky to square these up sometimes.

I don’t think you have to square them because those sentiments are coming from different people. They are also coming from people at different points along the adoption curve. If you are struggling and you see other people struggling at the beginning of the adoption curve it can be quite difficult difficult to understand someone who is further along and does not appear to be struggling.

I think a lot of folks who have struggled with these tools do so because both critics and boosters create unrealistic expectations.

What I recommend is you keep trying. This is a new skill set. It is a different skill set. Which other skills that existed in the past remain necessary is not known.

lumpilumpi2mo ago

My experience is that the first iteration output from a single agent is not what I want to be on the hook for. What squares it for me with "not writing code anymore" is the iterative process to improve outputs:

1) Having review loops between agents (spawn separate "reviewer" agents) and clear tests / eval criteria improved results quite a bit for me. 2) Reviewing manually and giving instructions for improvements is necessary to have code I can own

rsynnott2mo ago

Is that… actually faster than just doing it yourself, tho? Like, “I could write the right thing, or I could have this robot write the wrong thing and then nag it til it corrects itself” seems to suggest a fairly obvious choice.

I’ve yet to see these things do well on anything but trivial boilerplate.

fragmede2mo ago

Think of it like installing Linux. The first time it's absolutely not worth it from a time perspective. But after you've installed it once, you can reuse that installation, and eventually it makes sense and becomes second nature. Eventually that time investment pays dividends. Just like Linux tho, no one's going to force to you to install it and you'll probably go on to have a fine career without ever having touched the stuff.

girvo2mo ago

In my experience, sometimes. Not that often, depends on the task.

The benefit is I can keep some things ticking over while I’m in meetings, to be honest.

birdfood2mo ago

I was in the same boat as you until I saw DHH post about how he’s changed his use of agents. In his talk with Lex Fridman his approach was similar to mine and it really felt like a kernel of sanity amongst the hype. So when he said he’s changed his approach I had another look. I’m using agents (Claude code) every day now. I still write code every day too. (So does Dax Raad from OpenCode to throw a bit more weight behind this stance). I’m not convinced the models can own a production code base and that therefore engineers need to maintain their skills sufficiently to be responsible. I find agents helpful for a lot of stuff, usually heavily patterned code with a lot of prior art. I find CC consistently sucks at writing polars code. I honestly don’t enjoy using agents at all and I don’t think anyone can honestly claim they know how this is going to shake out. But I feel by using the tools myself I have a much stronger sense of reality amongst the hype.

jkhdigital2mo ago

I strongly agree with that last statement—I hate using agents because their code smells awful even if it works. But I have to use them now because otherwise I’m going to wake up one day and be 100% obsolete and never even notice how it happened.

xXSLAYERXx2mo ago

I still write code but do not push everything off to the agent. Try my best to write small tasks. ~20% of the time I have to get in there. If someone says they're absolutely not writing a line of code they must have amazing guardrails.

jkhdigital2mo ago

Today I gave a lecture to my undergraduate data structures students about the evolution of CPU and GPU architectures since the late 1970s. The main themes:

- Through the last two decades of the 20th century, Moore’s Law held and ensured that more transistors could be packed into next year’s chips that could run at faster and faster clock speeds. Software floated on a rising tide of hardware performance so writing fast code wasn’t always worth the effort.

- Power consumption doesn’t vary with transistor density but varies with the cube of clock frequency, so by the early 2000s Intel hit a wall and couldn’t push the clock above ~4GHz with normal heat dissipation methods. Multi-core processors were the only way to keep the performance increasing year after year.

- Up to this point the CPU could squeeze out performance increases by parallelizing sequential code through clever scheduling tricks (and compilers could provide an assist by unrolling loops) but with multiple cores software developers could no longer pretend that concurrent programming was only something that academics and HPC clusters cared about.

CS curricula are mostly still stuck in the early 2000s, or at least it feels that way. We teach big-O and use it to show that mergesort or quicksort will beat the pants off of bubble sort, but topics like Amdahl’s Law are buried in an upper-level elective when in fact it is much more directly relevant to the performance of real code, on real present-day workloads, than a typical big-O analysis.

In any case, I used all this as justification for teaching bitonic sort to 2nd and 3rd year undergrads.

My point here is that Simon’s assertion that “code is cheap” feels a lot like the kind of paradigm shift that comes from realizing that in a world with easily accessible massively parallel compute hardware, the things that matter for writing performant software have completely shifted: minimizing branching and data dependencies produces code that looks profoundly different than what most developers are used to. e.g. running 5 linear passes over a column might actually be faster than a single merged pass if those 5 passes touch different memory and the merged pass has to wait to shuffle all that data in and out of the cache because it doesn’t fit.

What all this means for the software development process I can’t say, but the payoff will be tremendous (10-100x, just like with properly parallelized code) for those who can see the new paradigm first and exploit it.

storus2mo ago

These lessons get obliterated with every new LLM generation. Like how LangChain started on stupid models with small context, creating some crazy architecture around it to bypass their limitations that got completely obliterated when GPT-3.5 was released, yet people still use it and overcomplicate things. Rather look at where the puck is going, we might soon not need more than a single agent to do everything given context size keeps increasing, agent can use more tools and we might get some in-call context cleanup at some point as well that would allow an agent to spin forever instead of calling subagents due to context size limitations.

simonw2mo ago

I'm trying to include patterns that work independently of model releases.

It's tricky though. Take "red/green TDD" for example - it's perfectly possible that models will start defaulting to doing that anyway pretty soon.

In that case it's only three words so it doesn't feel hugely wasteful if it turns out not to be necessary - and there's still value in understanding what it means even if you no longer have to explicitly tell the agents to do it.

anon-39882mo ago

The biggest takeaway for me from LLMs is that the implementation details no longer. If you have a sufficiently detailed tests and requirements, there is going to be a robot that will roll the dice until it fits the tests and requirements.

ljlolel2mo ago

It’ll all be a ClaudeVM. No code. https://jperla.com/blog/claude-electron-not-claudevm

jihadjihad2mo ago

I wish there was a little more color in the Testing and QA section. While I agree with this:

  > A comprehensive test suite is by far the most effective way to keep those features working.

there is no mention at all about LLMs' tendency to write tautological tests--tests that pass because they are defined to pass. Or, tests that are not at all relevant or useful, and are ultimately noise in the codebase wasting cycles on every CI run. Sometimes to pass the tests the model might even hardcode a value in a unit test itself!

IMO this section is a great place to show how we as humans can guide the LLM toward a rigorous test suite, rather than one that has a lot of "coverage" but doesn't actually provide sound guarantees about behavior.

tshaddox2mo ago

Do you have an example of the tautological tests you're referring to? What comes to mind to me is genuinely logically tautological tests, like "assert(true || expectedResult == actualResult)" which is a mistake I don't even expect modern AI coding tools to make. But I suspect you're talking about a subtler type of test which at first glance appears useful but actually isn't.

tveita2mo ago

I've definitely seen Opus go to town when asked to test a fairly simple builder. Possibly it inferred something about testing the "contract", and went on to test such properties as

  - none of the "final" fields have changed after calling each method
  - these two immutable objects we just confirmed differ on a property are not the same object

In addition to multiple tests with essentially identical code, multiple test classes with largely duplicated tests etc.

jihadjihad2mo ago

Among many other possible examples, here are a few [0] from Ruby that I've seen in the wild before LLMs, and still see today spat out by LLMs.

0: https://www.codewithjason.com/examples-pointless-rspec-tests...

simonw2mo ago

I do see agents pop out tests that look like this occasionally:

  it { expect(classroom).to have_many(:students) }

If I catch them I tell them not to and they remove it again, but a few do end up slipping through.

I'm not sure that they're particularly harmful any more though. It used to be that they added extra weight to your test suite, meaning when you make changes you have to update pointless tests.

But if the agent is updating the pointless tests for you I can afford a little bit of unnecessary testing bloat.

1 more reply

adampunk2mo ago

I don’t have examples but I have an LLM driven project with like…2500 tests and I regularly need to prune:

* no-op tests

* unit tests labeled as integration tests

* skipped tests set to skip because they were failing and the agent didn’t want to fix them

* tests that can never fail

Probably at any given time the tests are 2-4% broken. I’d say about 10% of one-shot tests are bogus if you’re just working w spec + chat and don’t have extra testing harnesses.

esafak2mo ago

For example, you might write a concurrency test, and the agent will cheerfully remove the concurrency and announce that it passes. They get so hung up on making things work in a narrow sense that they lose track of the purpose.

john-tells-all2mo ago

Yes. And, a bad test -- that passes because it's defined to pass -- is _much worse_ than no test at all. It makes you think an edge case is "covered" with a meaningful check.

Worse: once you have one "bad apple" in your pile of tests, it decreases trust in the _whole batch of tests_. Each time a test passes, you have to think if it's a bad test...

alkonaut2mo ago

This seems it should be very easy to validate. Force the AI to make minimal changes to the code under test, which makes a single (or as few as possible) test fail as a result. If it can't make a test fail at all, it should be useless.

jihadjihad2mo ago

Agreed, and that's why I think adding some example prompts and ideas to the Testing section would be helpful. A vanilla-prompted LLM, in my experience, is very unreliable at adding tests that fail when the changes are reverted.

Many times I've observed that the tests added by the model simply pass as part of the changes, but still pass even when those changes are no longer applied.

simonw2mo ago

I had an example in that section but it got picked apart by pedants (who had good points) so I removed it. I plan to add another soon. You can still see it in the changelog: https://simonwillison.net/guides/agentic-engineering-pattern...

1 more reply

ndriscoll2mo ago

This is essentially dual to the idea behind mutation testing, and should be trivial to do with a mutation testing framework in place (track whether a given test catches mutants, or more sophisticated: whether it catches the exact same mutants as some other test).

simonw2mo ago

That's part of the reason I like red/green TDD - you make the agent show that the test fails before the implementation and passes afterwards.

It can still cheat, but it's less likely to cheat.

jeremyloy_wt2mo ago

> we as humans can guide the LLM toward a rigorous test suite, rather than one that has a lot of "coverage" but doesn't actually provide sound guarantees about behavior.

I have a hard enough time getting humans to write tests like this…

lbreakjai2mo ago

That's where mutation testing becomes even more valuable. If the test still passes after the code has been mutated, then you may want to look deeper, because it's a sign that the test is not good.

sd92mo ago

I've recently got into red/greed TDD with claude code, and I have to agree that it seems like the right way to go.

As my projects were growing in complexity and scope, I found myself worrying that we were building things that would subtly break other parts of the application. Because of the limited context windows, it was clear that after a certain size, Claude kind of stops understanding how the work you're doing interacts with the rest of the system. Tests help protect against that.

Red/green TDD specifically ensures that the current work is quite focused on the thing that you're actually trying to accomplish, in that you can observe a concrete change in behaviour as a result of the change, with the added benefit of growing the test suite over time.

It's also easier than ever to create comprehensive integration test suites - my most valuable tests are tests that test entire user facing workflows with only UI elements, using a real backend.

vessenes2mo ago

Red/green is especially good with claude because even now with opus 4.6, claude can throw out a little comment like “//Implementation on hold until X/Y/Z: return { true }” and proceed to completely skip implementation based on the inline skip comment for a longgg time. It used to do this aggressively even in the tests, but by and large red/green prompting helps immensely - it tells the agent “think of failing tests as SUCCESS right now” - then you’ll get lots of them.

I’ve always been partial to integration tests too. Hand coding made integration tests feel bad; you’re almost doubling the code output in some cases - especially if you end up needing to mock a bunch of servers. Nowadays that’s cheap, which is super helpful.

sd92mo ago

Yeah, I've always _preferred_ integration tests, but the cost of building them was so great. Now the cost is effectively eliminated, and if you make a change that genuinely does affect an integration test (changing the text on a button, for example) it's easy to smart-find-and-replace and fix them up. So I'm using them a lot more.

The only problem is... they still take much longer to _run_ than unit tests, and they do tend to be more flaky (although Claude is helpful in fixing flaky tests too). I'm grateful for the extra safety, but it makes deployments that much slower. I've not really found a solution to that part beyond parallelising.

jghn2mo ago

Granted it doesn't always pay attention to Claude.md but one thing I've done is in my block of rules it must always follow is to never leave something unimplemented w/ placeholders unless explicitly told to do so. It's made this mostly go away for me.

nishantjani102mo ago

I primarily use AI for understanding codebases myself. My prompt is:

"deeply understand this codebase, clearly noting async/sync nature, entry points and external integration. Once understood prepare for follow up questions from me in a rapid fire pattern, your goal is to keep responses concise and always cite code snippets to ensure responses are factual and not hallucinated. With every response ask me if this particular piece of knowledge should be persistent into codebase.md"

Both the concise and structure nature (code snippets) help me gain knowledge of the entire codebase - as I progressively ask complex questions on the codebase.

onionisafruit2mo ago

I tried a slight variation of your prompt after reading this. It worked marvelously. Quick, correct answers instead of waiting for it to do exploration for each answer.

ukuina2mo ago

I find StrongDM's Dark Factory principles more immediately actionable (sorry, Simon!): https://factory.strongdm.ai/principles

eviluncle2mo ago

Not sure there's anything to be sorry for, he literally wrote about it a few weeks ago:

https://simonwillison.net/2026/Feb/7/software-factory/

9wzYQbTYsAIc2mo ago

I second that, sometimes it's defensibly worth throwing token fuel at the problem and validate as you go.

gaigalas2mo ago

The most important thing you need to understand with working with agents for coding is that now you design a production line. And that has nothing to do (mostly) with designing or orchestrating agents.

Take a guitar, for example. You don't industrialize the manufacture of guitars by speeding up the same practices that artisans used to build them. You don't create machines that resemble individual artisans in their previous roles (like everyone seems to be trying to do with AI and software). You become Leo Fender, and you design a new kind of guitar that is made to be manufactured at another level of scale magnitude. You need to be Leo Fender though (not a talented guitarrist, but definitely a technical master).

To me, it sounds too early to describe patterns, since we haven't met the Ford/Fender/etc equivalent of this yet. I do appreciate the attempt though.

geon2mo ago

The machines in factory production lines are generally very deterministic. Not sure how well industrialisation would have worked if the machines just did whatever.

gaigalas2mo ago

Again, this word "deterministic". It means nothing anymore.

When you see a sorting machine that jiggles lots of pieces so they align, that's because pieces don't align naturally. It's a fix for chaos, for things that naturally behave like "doing whatever".

Industrial machinery is full of this in all sorts of places. Even in precision engineering. Press-fits and interference-fits, etc. We deal with lack of precision all the time.

Engineers are _absolute chads_ on this kind of thing. We tame chaos like no other professional.

geon2mo ago

That’s what I’m saying. We should tame the chaos, not encourage it.

The screw sorting machines don’t generally decide to start spitting out resistors instead.

andresquez2mo ago

I see a lot of people complaining that every day there are 100 new frameworks for “agent teams”, prompting styles, workflows, and everyone insists theirs is the best for one reason or another. It reminds me a lot of early software engineering: every team had its own way of doing things, we experimented with tons of methodologies (waterfall, agile, etc.), and over time a few patterns became widely adopted (scrum, PM roles, architects, tickets, rituals). It feels like we’re in that same messy exploration phase right now.

And actually, these tools actually work, , because 99% of people still don’t really know how to prompt agents well and end up doing things like “pls fix this, it’s not working”.

One thing that worked well for us was going back to how a human team would approach it: write a product spec first (expected behavior, constraints, acceptance criteria, etc), use AI to refine that spec, and only then hand it to an opinionated flow of agents that reflect a human team to implement.

shreddd242mo ago

Absolutely great work. I have been mostly just thinking about what you are already practicing. I think your site will become an invaluable source for software engineers who want to responsibly apply AI in their development flow.

For a high level description of what this new way of engineering is about: https://substack.com/@shreddd/p-189554031

bluemario2mo ago

The "human in the loop at key checkpoints" pattern has been the most practically useful for us. We found that giving the agent full autonomy end-to-end produces subtly broken code that passes tests but violates implicit invariants you never thought to write down. Short loops with a human sanity check at decision forks catches that class of failure early.

The thing I keep wrestling with is where exactly to place those checkpoints. Too frequent and you've just built a slow pair programmer. Too infrequent and you're doing expensive archaeology to figure out where it went sideways. We've landed on "before any irreversible action" as a useful heuristic, but that requires the agent to have some model of what's irreversible, which is its own can of worms.

Has anyone found a principled way to communicate implicit codebase conventions to an agent beyond just dumping a CLAUDE.md or similar file? We've tried encoding constraints as linter rules but that only catches surface stuff, not architectural intent.

mrothroc2mo ago

The checkpoint pattern you describe is exactly right. I've been dealing with this as well. Instead of vibe coding, it's vibe system engineering and I don't care for it. So I thought about it and came up with a framework to describe and reason about different pipelines. I based it on the types of LLM failures I was seeing in my own pipeline (omissions, incorrect, or inconsistent with existing stuff).

I wanted something I could use to objectively decide if one test (or gate, as I call them) is better than another, and how do they work as a holistic system.

My personal tool encodes a workflow that has stages and gates. The gates enforce handoff. Once I did this I went from ~73% first-pass approval to over 90% just by adding structured checks at stage boundaries.

My hope is that we can have a common vocabulary to talk about this, so I wrote up the data and the framework that fell out of it: https://michael.roth.rocks/research/trust-topology/

tacone2mo ago

The patterns in the article might be a starter, but there's so much more to cover:

agents role (Orchestrator, QA etc.), agents communication, thinking patterns, iteration patterns, feature folders, time-aware changelog tracking, prompt enforcing, real time steering.

We might really need a public Wiki for that (C2 [1] style)

[1]: https://wiki.c2.com/

tr8882mo ago

For web apps, explictly asking the agent to build in sensible checkpoints and validate at the checkpoint using Playwright has been very successful for me so far. It prevents the agent from strating off course and struggling to find its way back. That and always using plan mode first, and reviewing the plan for evidence of sensible checkpoints. /opusplan to save tokens!

winwang2mo ago

Linear walkthrough: I ask my agents to give me a numbered tree. Controlling tree size specifies granularity. Numbering means it's simple to refer to points for discussion.

Other things that I feel are useful:

- Very strict typing/static analysis

- Denying tool usage with a hook telling the agent why+what they should do (instead of simple denial, or dangerously accepting everything)

- Using different models for code review

ben302mo ago

I contribute to an open source spec based project management tool. I spend about a day back and forth iterating on a spec, using ai to refine the spec itself. Sometimes feeding it in and out of Claude/gemini telling each other where the feedback has come from. The spec is the value. Using the ai pm tool I break it down into n tasks and sub tasks and dependencies. I then trigger Claude in teams mode to accomplish the project. It can be left alone over night. I wake up in the morning with n prs merged.

Denzel2mo ago

Mind linking the project so we can see the PR’s?

AlexCalderAI2mo ago

Great patterns here. I'd add one more critical layer that many miss: orchestration state management.

Running multiple agents concurrently (QA, content, conversions, distribution), we hit this exact wall - agents didn't know what other agents had done, creating duplicate work and missed context.

Solved it with a stupidly simple approach: 1. Single TODO.md with "DO NOW" (unblocked), "BLOCKED", "DONE" sections 2. Named output files per agent type (qa-status.md, scout-finds.md, etc) 3. active-tasks.md for crash recovery - breadcrumbs from interrupted runs 4. Daily memory logs with session IDs for searchability

The key: File-based state is deterministic. After a crash, the next agent reads identical input, same decision rules, same output structure. Zero state collision, zero "what was I thinking?"

Deployment: ~8 agents on cron. They wake, read files, work, write results, die. No persistent terminal. No coordination overhead.

This turned "5 terminal tabs with unmanageable logs" into "grep yesterday's log, see exactly what happened."

Patterns + implementation details: https://osolobo.com/first-ai-agent-guide/

yieldcrv2mo ago

I dont currently have confidence in TDD

A broken test doesn’t make the agentic coding tool go “ooooh I made a bad assumption” any more than a type error or linter does

All a broken test does it prompt me to prompt back “fix tests”

I have no clue which one broke or why or what was missed, and it doesnt matter. Actual regressions are different and not dependent on these tests, and I follow along from type errors and LLM observability

3 more replies

ryanthedev2mo ago

Ahh, I tend to find software based engineering skills and workflows as the agentic engineering patterns.

I distilled multiple software books into these flows and skills. With more books to come.

Here is an example https://github.com/ryanthedev/code-foundations

ontouchstart2mo ago

Hoarding is becoming an epidemic mental disease in the society of abundance. I don’t know what the solution would be.

https://simonwillison.net/guides/agentic-engineering-pattern...

simonw2mo ago

Personally my plan is to hoard more.

wokwokwok2mo ago

I really like the idea of agent coding patterns. This feels like it could be expanded easily with more content though. Off the top of my head:

- tell the agent to write a plan, review the plan, tell the agent to implement the plan

- allow the agent to “self discover” the test harness (eg. “Validate this c compiler against gcc”)

- queue a bunch of tasks with // todo … and yolo “fix all the todo tasks”

- validate against a known output (“translate this to rust and ensure it emits the same byte or byte output as you go”)

- pick a suitable language for the task (“go is best for this task because I tried several languages and it did the best for this domain in go”)

fennecfoxy2mo ago

This sort of thing is available using utilities like spec kit/spec kitty/etc. But yes it does make it do better, including writing its own checklists so that it comes back to the tasks it identified early on without distraction.

ozim2mo ago

From briefly checking the important one is:

*Hoard things you know how to do*

It will make everything faster for you - even if you can ask AI it will be more costly to do it from scratch.

Also it is nothing new under the sun. In old days a developer would have his own stack of libraries and books and would not need to do NPM i for someone elses code because he would have bunch of own libraries ready to go. Of course one can say, there will always be a library that is better then yours ... but is it? :)

jpadkins2mo ago

My simple Agent loop for hobby game dev (In antigravity, but this also works well in Claude Code). 1) I write the prompt for the next feature / tweak / fix I want the model to work on 2) if large, check implementation plan 3) play test prior changes 4) repeat. By the time my play test is done, the next batch of changes are ready for commit.

Has anyone setup a smooth agent setup for game art assets generation? (AI models already do great for shaders and VFX, but I would really love to automate model + texture + animation pipeline)

Thews2mo ago

There was a mention of using agents to build projects into WASM. I've had the best luck telling it to use zig to compile to webassembly. It shortens the time to completion by a significant amount.

simonw2mo ago

That's a great tip, thanks! I did not know Zig could do this.

You can "pip install ziglang" and get the right version for different platforms too.

AndyKelley2mo ago

It's not a great tip because there are features that exist specifically to reduce development iteration cycle latency without compiling for the wrong target.

Please refer to https://ziglang.org/download/0.15.1/release-notes.html#Incre...

This has nothing to do with agentic engineering. This is just normal software development. Everybody wants faster compilation speed

SurvivorForge2mo ago

The code review bottleneck point resonates a lot. When agents can generate PRs in minutes, the human review step becomes the critical bottleneck — and it doesn't scale with generation speed. The teams I've seen handle this best treat agent output like a junior dev's work: smaller atomic commits, mandatory test coverage as a gate, and explicit reviewer checklists focused on logic rather than syntax. The shift is from "does this look right" to "does this behave correctly under these conditions."

simonw2mo ago

I just started a new chapter partly inspired by this comment thread - anti-patterns: things NOT to do.

So far I only have one: Inflicting unreviewed code on collaborators, aka dumping a thousand line PR without even making sure it works first https://simonwillison.net/guides/agentic-engineering-pattern...

jcmontx2mo ago

It baffles me how skeptical people here are of AI-assisted programming. If you don't see productivity gains I feel you're in deep denial.

It's true that in my company we're not building rockets or defense systems, maybe you guys are and in those scenarios it's less useful. But for typical LoB and/or consumer-facing software, AI is crushing it. Where I used to need 3 devs, now I just need one (and the support team around it: PM, BA, QA, Designer). For my business, AI has been a game changer.

yoaviram2mo ago

Yesterday I wrote a post about exactly this. Software development, as the act of manually producing code, is dying. A new discipline is being born. It is much closer to proper engineering.

Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.

The marginal cost of code is collapsing. That single fact changes everything.

https://nonstructured.com/zen-of-ai-coding/

raincole2mo ago

> wrote

Quite a heavy-lifting word here. You understand why people flagged that post right? It's painfully non-human. I'm all for utilizing LLM, but I highly suggest you read Simon's posts. He's obviously a heavy AI user, but even his blog posts aren't that inorganic and that's why he became the new HN blog babe.

[0]: I personally believe Simon writes with his own voice, but who knows?

fragmede2mo ago

How paranoid do you want to get? Simone's written enough, such that you could just feed his blog to AI and ask it to write in his voice. Which, taken to the logical extreme, means that the last time he went to visit OpenAI, he was captured, and locked in a dungeon, and his online presence is now entirely AI with the right prompt. In fact, that's happened to everyone on this site, and we're all LLMs just predicting the next word at each other.

There's no actual way to determine if any words are from a silicon token generator or meat-based generator. It's not AI, it's human! Emdash. You're absolutely right!

system failure.

hollowturtle2mo ago

We have the entire web built on technical debt and LLMs mostly trained on that, what could go wrong? Cost will reside somewhere else if not on code

hresvelgr2mo ago

> It is much closer to proper engineering.

I would not equate software engineering to "proper" engineering insofar as being uttered in the same sentence as mechanical, chemical, or electrical engineering.

The cost of code is collapsing because web development is not broadly rigorous, robust software was never a priority, and everyone knows it. The people complaining that AI isn't good enough yet don't grasp that neither are many who are in the profession currently.

Arkhaine_kupo2mo ago

> The people complaining that AI isn't good enough yet don't grasp that neither are many who are in the profession currently.

I think the externalities are being ignored. Having time and money to train engineers is expensive. Having all the data of your users being stolen is a slap in the wrist.

So replacing those bad worekrs with AI is fine. Unless you remove the incentives to be fast instead of good, then yeah AI can be good enough for some cases.

6LLvveMx2koXfwn2mo ago

Indeed, it's like those complaining self-driving cars occasionally crash when their crash rates are up to 90% less than humans . . .

PunchTornado2mo ago

You didn't write that and you shouldn't believe that you did.

offbynull2mo ago

This is such a strange take. Your words remind me of past crypto hype cycles, where people pushed web3.0 and NFT FOMO hysteria.

Engineering is the practical application of science and mathematics to solve problems. It sounds like you're maybe describing construction management instead. I'm not denying that there's value here, but what you're espousing seems divorced from reality. Good luck vibecoding a nontrivial actuarial model, then having it to pass the laundry list of reviews and having large firms actually pick it up.

xXSLAYERXx2mo ago

> This is such a strange take. Your words remind me of past crypto hype cycles, where people pushed web3.0 and NFT FOMO hysteria.

Thats a little harsh. I think most everyone would agree we're in a transformative time for engineering. Sure theres hype, but the adoption in our profession (assuming you're an engineer) isn't waning.

mpalmer2mo ago

It's not pleasant to read this.

    The claim here is profound: comprehension of the codebase at the function level is no longer necessary

It's not profound. It's not profound when I read the exact same awed blog post about how "agentic" is the future and you don't even need to know code anymore.

It wasn't profound the first time, and it's even dumber that people keep repeating it - maybe they take all the time they saved not writing, and use it to not read.

jplusequalt2mo ago

Stop putting forth your AI generated blog posts as your own work.

edelans2mo ago

Agree. This is a transition from being "in" the loop to being "on" the loop.

sltr2mo ago

The formal engineering disciplines are not defined by the construction vs design distinction so much as the regulatory gates they have passed and the ethical burdens they shoulder for society's benefit.

https://www.slater.dev/2025/09/its-time-to-license-software-...

lunias2mo ago

In most cases, the model is non-deterministic and you have no direct control over the input parameters. At best you might get access to some abstraction of a subset of those parameters. I don't know of a coding model that offers direct access to the seed. I like to hear about how people are using agents, but it also feels a lot like someone sitting at a slot machine telling you that if you put your shoes on the opposite feet then you win more often.

fzaninotto2mo ago

Is "Agentic Engineering" is the new name for "Agent Experience"? If so, and even though I love Simon's contributions, there are many other guides to making codebases more welcoming to agents...

Shameless plug: I wrote one. https://marmelab.com/blog/2026/01/21/agent-experience.html

simonw2mo ago

I hadn't heard that term before, is it widely used?

https://agentexperience.ax/ describes it as "refers to the holistic experience AI agents have when interacting with a product, platform, or system" which feels to me like a different concept to figuring out patterns for effectively using coding agents as a software engineer.

Juminuvi2mo ago

Very much agree with the idea of red/green TDD and have seen really good results during agentic coding. I've found adding a linting step in between increases efficiency as well and fails a bit faster. So it becomes..

Test fail -> implement -> linter -> test pass

Another idea I've thought about using is docs driven development. So the instructions might look like..

Write doc for feat/bug > test fail > implement > lint > test pass

chillfox2mo ago

Isn’t this pretty much how everyone uses agents?

Feels like it’s a lot of words to say what amounts to make the agent do the steps we know works well for building software.

sosoeleken2mo ago

G is posting this slop so Anthropic sends him his dinner invitation this month, give him a break.

shubhamintech2mo ago

The test harness point is spot on but there's a gap worth naming: the failure modes you write evals for aren't the ones that cause users to churn. Prod conversations have a whole category where the agent doesn't error, it just confidently goes sideways in a way nobody wrote a test for. The teams actually retaining users from AI products are reading conversations, not just dashboards.

AlexCalderAI2mo ago

Solid patterns here. One thing I'd add from running Claude Code in production:

The "give it bash" pattern sounds scary until you realize the alternative is 47 intermediate tool calls that fail silently.

Letting the agent write and run scripts means the agent debugs when something breaks. The feedback loop tightens dramatically.

The trick is sandboxing + cost limits. Not preventing shell access.

alansaber2mo ago

The best thing I read in this was "Hoard things you know how to do" > basically get an LLM to mutate an existing function you know is 1. well written and 2. works. If you have many such components you're still assembling code rapidly but using building blocks you actually understand in depth, rather than getting an LLM to shit out something verbose.

lvl1552mo ago

People come up with the most insane workflow for agents. They complete about 80% of the work but that last 20% is basically equivalent to you doing the whole thing piece wise (with the help of AI). Except the latter gives you peace of mind.

I am still not sold on agentic coding. We’ll probably get there within the next couple of years.

xXSLAYERXx2mo ago

I'm curious what you've used it for? I was firmly in your camp until about a month ago when i used codex to dust off an old side project. I hadn't touched the project in six months. This was literally my first prompt:

"Explain the codebase to a newcomer. What is the general structure, what are the important things to know, and what are some pointers for things to learn next?"

Once I saw the output I giddyup'd and haven't looked back.

throwaway_203572mo ago

I see where Simon is coming from with these patterns but I wonder where large software companies stand regarding their agentic engineering practices? Is Google creating in-house code using agents against its monorepo? Has Microsoft outsourced Windows source code advancements to a dark factory yet?

sidcool2mo ago

PSA: This is sponsored by Augment code.

simonw2mo ago

Only until March 6th, I'm selling site-wide sponsorship a week at a time. Those sponsors get no influence over what I write about at all - I started this entire guide without even mentioning it to them.

sidcool2mo ago

Thanks, I did not mean it as an accusation of bias. Just something I saw on the page and shared. Appreciate you writing this and sharing.

noddingham2mo ago

I'd choose a different word for the title of Hoard Things You Know How to Do. Hoarding is the opposite of what we want to do but I get from reading the section you mean create a collection that you can draw upon. IMO "Share" is a much better word choice.

luca-ctx2mo ago

> I don't let LLMs write text for my blog.

Thank you Simon and I'm sure you would quickly fall off from #1 blogger on HN if you did. I insist on this for myself as well.

Somehow we are all getting really good at detecting "written by AI" with primal intuition.

lbreakjai2mo ago

We're going to do it again, aren't we? We're going to take something simple and sensible ("write tests first", "small composable modules", etc.), give it a fancy complicated name ("Behavior-Constrained Implementation Lifecycle pattern", "Boundary-Scoped Processing Constructs pattern", etc.), and create an entire industry of consultants and experts selling books and enterprise coaching around it, each swearing they have the secret sauce and the right incantations.

The damn thing _talks_. You can just _speak_ to it. You can just ask it to do what you want.

ElectricalUnion2mo ago

Common business-oriented language (COBOL) is a high-level, English-like, compiled programming language.

COBOL's promise was that it was human-like text, so we wouldn't need programmers anymore.

The problem is that the average person doesn't know how what their actual problems are in sufficient detail to get a working solution. When you get down to breaking down that problem... you become a programmer.

The main lesson of COBOL is that it isn't the computer interface/language that necessitates a programmer.

ca_tech2mo ago

Agreed, the programmer is not going away. However, I expect the role is going to change dramatically and the SDLC is going to have to adapt. The programmer used to be the non-deterministic function creating the deterministic code. Along with that were multiple levels of testing from unit to acceptance in order to come to some close alignment with what the end-user actually intended as their project goals. Now the programmer is using the probabilistic AI to generate definitive tests so that it can then non-deterministically create deterministic code to pass those tests. All to meet the indefinite project goals defined by the end-user. Or is there going to be another change in role where the project manager is the one using the AI to write the tests since they have a closer relationship to the customer and the programmer is the one responsible for wrangling the code to validate against those tests.

mexicocitinluez2mo ago

> The problem is that the average person doesn't know how what their actual problems are in sufficient detail to get a working solution. When you get down to breaking down that problem... you become a programmer.

Agreed. I've spent the last few years building an EMR at an actual agency and the idea that users know what they want and can articulate it to a degree that won't require ANY technical decisions is pure fantasy in my experience.

nightski2mo ago

Right now with agents this is definitely going to continue to be the case. That said, at the end of the day engineers work with stakeholders to come up with a solution. I see no reason why an agent couldn't perform this role in the future. I say this as someone who is excited but at the same time terrified of this future and what it means to our field.

I don't think we'll get their by scaling current techniques (Dario disagrees, and he's far more qualified albeit biased). I feel that current models are missing critical thinking skills that I feel you need to fully take on this role.

3 more replies

abustamam2mo ago

At my job, we use a lot of AI to literally move fast and break things when working on internal tools. The idea is that the surface area is low, rollbacks are fast, and the upside is a lot better than the downside (our end users get a better experience to help them do their job better).

But our bottleneck is still requirements for the project. We routinely run out of stuff to do and have to ask for new stuff or work on a different project.

But you're absolutely right. Most people (programmers, managers, etc) don't know exactly what problems need to be solved, or at least, struggle to communicate it adequately for it to be implemented well enough. They say they want X. But they haven't thought about the repercussions of it, or that it requires Y first. AI might be able to help there, but it will give a totally bogus answer if it does not have any context of the domain, which is almost never documented in code.

These are still very much so technical roles, but maybe we are becoming more "technical domain experts."

Terr_2mo ago

I predict the main democratization change is going to be how easy people can make plumbing that doesn't require--or at least not obviously require--such specificity or mental-modeling of the business domain.

For example, "Generate me some repeatable code to ask system X for data about Y, pull out value Z, and submit it to system W."

devin2mo ago

What happens when value Z is not >= X? What happens when value Z doesn't exist, but values J and K do? What should be done when...

I hear what you're saying, but I think it's going to be entertaining watching people go "I guess this is why we paid Bob all of that money all those years".

3 more replies

jimbokun2mo ago

How do you model the business domain without modeling the business domain?

logicprog2mo ago

I think the problem is that because it talks and understands English and more or less does whatever you ask, the affordences aren't particularly clear. That's actually one of the biggest problems with the chatbot model of AI — it has the same problems as the CLI, in that it's extremely flexible and powerful and you can do a lot with it and add a lot to it, but it's really not clear what way of interacting with it is more or less effective than any other, or what it can or can't do well.

I think attempts to document the most effective things to ask it to do in order to get to your overall goal, as well as what it is and is not good for, is probably worth doing. It would be bad if it turned into a whole consultant marketing OOP coaching clusterfuck. Yeah, but building some kind of community knowledge that these things aren't like, demigods, they have limitations and during things one way or the other with them can be better is probably a good thing. At the very least in theory would cut down some of the hype?

jerf2mo ago

Worse yet, the problems are going to be real.

There's a lifecycle to these hype runs, even when the thing behind the hype is plenty real. We're still in the phase where if you criticize AI you get told you don't "get it", so people are holding back some of their criticisms because they won't be received well. In this case, I'm not talking about the criticisms of the people standing back and taking shots at the tech, I'm talking about the criticisms of those heavily using it.

At some point, the dam will break, and it will become acceptable, if not fashionable, to talk about the real problems the tech is creating. Right now there is only the tiniest trickle from the folk who just don't care how they are perceived, but once it becomes acceptable it'll be a flood.

And there are going to be problems that come from using vast quantities of AI on a code base, especially of the form "created so much code my AI couldn't handle it anymore and neither could any of the humans involved". There's going to need to be a discussion on techniques on how to handle this. There's going to be characteristic problems and solutions.

The thing that really makes this hard to track though is the tech itself is moving faster than this cycle does. But if the exponential curve turns into a sigmoid curve, we're going to start hearing about these problems. If we just get a few more incremental improvements on what we have now, there absolutely are going to be patterns as to how to use AI and some very strong anti-patterns that we'll discover, and there will be consultants, and little companies that will specialize in fixing the problems, and people who propose buzzword solutions and give lots of talks about it and attract an annoying following online, and all that jazz. Unless AI proceeds to the point that it can completely replace a senior engineer from top to bottom, this is inevitable.

simonw2mo ago

> And there are going to be problems that come from using vast quantities of AI on a code base, especially of the form "created so much code my AI couldn't handle it anymore and neither could any of the humans involved". There's going to need to be a discussion on techniques on how to handle this. There's going to be characteristic problems and solutions.

That's essentially the thing we are calling "cognitive debt".

I have a chapter with one small thing to help address that here - https://simonwillison.net/guides/agentic-engineering-pattern... - but it's a much bigger topic and will require extensive exploration by the whole industry to figure out.

jerf2mo ago

Yeah, it's hard to even get started until we can go three months without a significant improvement in the AIs. Today's characteristic failures may not be 2027's characteristic failures. Example: Today I'm complaining that the AIs tend not to abstract as often as I'd like, but it's not hard to imagine it flipping until they're all architecture astronauts instead.

jimbokun2mo ago

Early in my career I would sometimes be told to not worry about making the code “nice” just get it working and move on. I would nod and just write good code like I always did, knowing it didn’t take longer than writing bad code, and would be much easier to modify and extend and fix later.

I feel like there’s a similar vibe coming with vibe coding. Just let the AI generate as much code as it wants, don’t check it because it doesn’t matter because only the LLM will be reading it anyway.

My gut tells me that

1. there will still be reasons for humans to understand the code for a long time,

2. even the LLM will struggle with modifying code last a certain size and complexity without good encapsulation and well thought out system architecture and design.

jerf2mo ago

I classify your latter points under "AIs are Finite": https://jerf.org/iri/post/2026/what_value_code_in_ai_era/

chasd002mo ago

> We're going to do it again, aren't we?

yes. It sucks but I think it's good for the next generation of tech industry employees to watch this. It's happening quickly so you get a 10 year timeline compressed into a few years which makes it easier to follow and expose. The bloggers will come, then speakers, then there will be books. Consultants will latch on and start initiatives at their clients. Once enough large enterprises are sold on it, there will come associations and certification bodies so a company can say "we have X certified abc on staff". Manifestos will be released, version numbers will be incremented so there's a steady flow of work writing books, doing trainings, and getting the next level certified.

This is standard issue tech industry stuff (and it probably happens everywhere else too) but compressed into a tighter timeline so you don't have to wait a decade to see it unfold.

esafak2mo ago

Better cash in on the woo woo before it gets old! https://www.youtube.com/watch?v=OMQuBTGr52I

tptacek2mo ago

Wait: "write tests first" isn't simple and it's controversial. The benefits of TDD in pure-human development are debatable (I'd argue, in many cases, even dubious). But the equation changes with LLMs, because the cost of generating tests (and of keeping them up to date) plummets, and test cases are some of the easiest code to generate and reason about.

It's not as simple an observation as you're making it out to be.

keeda2mo ago

I'm not sure what this comment is addressing, I didn't find any fancy terms in TFA? If it's the title of the article itself, it seems simpler than "Things that help writing code effectively with AI agents."

> You can just ask it to do what you want.

Yes, but very clearly, as any HN thread on AI shows, different people are having VERY different outcomes with it. And I suspect it is largely the misconception that it will magically "just do what you want" that leads to poor outcomes.

The techniques mentioned -- coding, docs, modularity etc. -- may seem obvious now, but only recently did we realize that the primary principle emerging is "what's good for humans is good for agents." That was not at all obvious when we started off. It is doubly counter-intuitive given the foremost caveat has been "Don't anthropomorphize AI." I'm finding that is actually a decent way to understand these models. They are unnervingly like us, yet not like us.

All that to say, AI is essentially black magic and it is not yet obvious how to use it well for all people and all use-cases, so yes, more exposition is warranted.

didgeoridoo2mo ago

I don’t know, Simon has had a pretty sane and level head on his shoulders on this stuff. To my mind he’s earned the right to be taken seriously when talking about approaches he has found helpful.

monooso2mo ago

I'm confused. Are you criticising the article, or simply expressing concern for what may happen?

The context suggests the former, but your criticisms bear no relation to the linked content. If anything, your edict to "write tests first" is even more succinctly expressed as "Red/green TDD".

lbreakjai2mo ago

But it is related, isn't it? I wrote "...each swearing they have the secret sauce and the right incantations...". Now compare it to ""Use red/green TDD" is a pleasingly succinct way to get better results out of a coding agent."

Doesn't it sound like the "right incantation"? That's the point of LLMs, they can understand (*) intent. You'd get the same result saying "do tdd" or "do the stuff everyone says they do but they don't, with the failing test first, don't remember the name, but you know what I'm saying innit?"

I'm perhaps uncharitable, and this article just happens to take the collateral damage, but I'm starting to see the same corruption that turned "At regular intervals, the team reflects on how to become more effective" into "Mandatory retro exactly once every fortnight, on a board with precisely three columns".

pixl972mo ago

>Doesn't it sound like the "right incantation"?

It sounds like you have a misunderstanding of what LLMs are/can do.

Imagine that you only get one first interaction with a person that you're having try to build something and you're trying to minimize the amount back and forth.

For humans this can be something like an instruction manual. If you've put together more than a few things you quickly realize that instruction manuals vary highly in quality, some will make your life much easier and other will leave you confused.

Lastly, (human) intent is a social construct. The more closely you're aligned with the entity in question the more it's apt to fully comprehend your intent. This is partially the reason why when you throw a project at workers in your office they tend to get it right, and when you throw it towards the overseas team you'll have to check in a lot more to ensure it's not going off the rails.

monooso2mo ago

I view it as a collection of potentially helpful tips which have worked well for the author, which is exactly how it's presented.

There's no suggestion that this is The Only Blessed Way.

flir2mo ago

Has anyone staked a claim to "Agile AI" yet?

jermaustin12mo ago

I suggest "AIgile" for brevity.

kaycey20222mo ago

Agile Intelligence

joelthelion2mo ago

I've seen several already. There's a huge business opportunity (at our expense, of course).

Rohunyyy2mo ago

At this point what is happening that is not at our expense? Hell if I could be a grifter and start another .ai company honestly I would. I guess I am just not that talented.

ryanthedev2mo ago

You haven’t heard of spec driven development?!? Haha.

p_v_doom2mo ago

you mean agile dot ai?

JHer2mo ago

If all I have todo is ask the thing what I want, where is all the great new software? Why isn't everyone running fully bespoke operating systems by now?

While I agree with the sentiment that we shouldn't make things more complicated by inventing fancy names, we also shouldn't pretend that software engineering has become super simple now. Building a great piece of software remains super hard to do and finding better techniques for it affords real study.

Your post is annoying me quite a bit because it's super unfair to the linked post. Simon Willison isn't trying to coin a new term, he's just trying to start a collection of useful patterns. "Agentic engineering" is just the obvious term for software engineering using agents. What would you call it, "just asking things"?

lbreakjai2mo ago

> If all I have todo is ask the thing what I want, where is all the great new software? Why isn't everyone running fully bespoke operating systems by now?

I was speaking from a software engineer's point of view, in the context of the article, where one of the "agentic" patterns is ... test-driven development? Which you summon out of the agent by saying ... "Do test-driven development", more or less?

> What would you call it, "just asking things"?

I'd call it software engineering. What makes good software didn't suddenly change, and there's fundamentally no secret sauce to getting an agent to do it. If I think a class does too many things, then I tell the agent "This class does too many things, move X and Y methods into a new class".

ozim2mo ago

You have it all backwards.

We are having simple and sensible stuff.

But then bunch of assholes who don't know better and just want to milk $$$ will come over and ruin it for everyone.

layer82mo ago

> create an entire industry of consultants and experts selling books and enterprise coaching around it

I suspect that this time around, management will expect the AI chatbot to explain these things to you, because who pays for anything anymore if the AI can do it all.

lbreakjai2mo ago

That's why there needs to be some mysticism around it. The agile manifesto is short enough to be memorised by a toddler, simple enough to be understood by management, yet created an entire industry of parasites around it.

If the answer is just "Install Oh My Opencode and stick any decent model in it" then it doesn't work.

And honestly, the answer is just to install Oh My Opencode and Kimi K2.5 and get 90% of the performance of Opus for a fraction of the price.

MattGrommes2mo ago

There's already BMAD - Breakthrough Method of Agile Agent Driven Development

Basically, it's Waterfall for Agents. Lots of Capitalized Words to signify something.

Also they constantly call it the BMAD Method, even though the M already stands for method.

SecretDreams2mo ago

> The damn thing _talks_. You can just _speak_ to it. You can just ask it to do what you want.

But can it pass the butter?

solarkraft2mo ago

> The damn thing _talks_. You can just _speak_ to it. You can just ask it to do what you want

I mean - yeah. So do humans. But it turns out that that a lot of humans require considerable process to productively organize too. A pet thesis of mine is that we are just (re-) discovering the usefulness of process and protocol.

63stack2mo ago

People are rushing to be the first one to coin something and hit it big. Imagine the amount of $$$ you could get for being an "expert ai consultant" in this space.

There was already another attempt at agentic patterns earlier:

https://agentic-patterns.com/

Absolute hot air garbage.

simonw2mo ago

Which pieces of my writing are garbage?

anukin2mo ago

I don’t think these kind of outbursts from some random guy in HN requires your response.

You have helped a lot of people from junior to staff+ level to understand how to use agents for software engineering using simple language. Calling it garbage is gross injustice to the work you put out.

andy_ppp2mo ago

They won't have a decent response, this is the Internet after all. I really enjoyed it thanks for writing it and I'll take a lot of it onboard. I think everyone will have their own software stack and AIs designed perfectly for them to do their work in the future.

shepherdjerred2mo ago

I've followed you for a while (maybe 2-3 years?) and love your writing. Your posts are always approachable and easy to digest.

I really don't understand where the HN hate comes from. I hope you aren't giving negative comments too much attention.

echelon2mo ago

It's not hot air garbage.

Secondly: this is a temporary vacuum state. We're only needed to bridge the gap.

I wouldn't be trying to be a consultant, I would be scurrying to ensure we have access to these tools once they're industrial. A "$5M button" to create any business function won't be within the reach of labor capital, but it will be for financial capital. That's the world we're headed to.

dgunay2mo ago

A lot of this is just things that high-functioning human teams were already doing: automate testing, explain your PRs to guide reviewers, demoing work, not just throwing bad code over the wall during code review, etc.

kubb2mo ago

Is there a market for this like OOP patterns that used to sell in the 90s?

arjie2mo ago

The underlying technology is still improving at a rapid pace. Many of last year's tricks are a waste of tokens now. Some ideas seem less fragile: knowing two things allows you to imagine the confluence of the two so you know to ask. Other things are less so: I'm a big fan of the test-based iteration loop; it is so effective that I suspect almost all users have arrived at it independently[0]. But the emergent properties of models are so hard to actually imagine. A future sufficiently-smart intelligence may take a different approach that is less search and more proof. I wouldn't bet on it, but I've been surprised too many times over the last few years.

0: https://wiki.roshangeorge.dev/w/Blog/2025-12-01/Grounding_Yo...

jascha_eng2mo ago

It definitely feels like everyone is trying to sell you something that is supposed to help you build rather than actually building useful stuff.

Which is oddly close to how investment advice is given. If these techniques work so well, why give them up for free?

ares6232mo ago

everybody's trying to become the next Uncle Bob

aksjfp2222mo ago

I mainly work with documents as a white collar worker but have vibe coded a few bits.

The thing I keep coming back to is that it's all code. Almost all white collar professions have at least some key outputs in code. Whether you are a store manager filling out reports or a marketing firm or a teacher, there is so much code.

This means you can give claude code a branded document template, fill it out, include images etc. and uploaded to our cloud hosting.

With this same guidance and taste, I'm doing close to the work of 5 people.

Setup: Claude code with full API access to all my digital spaces + tmux running 3-5 tasks in parallel

vjerancrnjak2mo ago

White colar work is just a lucky place to be, 99% of it is completely made up, there's people doing nothing, and people doing work of 10 people, does not matter, the work itself has no impact on anything.

A nice way to realize why this AI wave hasn't produced massive economy growth, it is mostly touching parts of economy which are parasitic and can't really create growth.

fud1012mo ago

Any word on patterns for security and deployment to prod?

jensbontinck2mo ago

A few patterns we've found effective deploying agents to production:

1. Proxy-based governance. Route all LLM traffic through a governance layer. The agent never holds API keys directly — the proxy holds them and issues scoped, short-lived capability tokens (ES256, 60s TTL). Single enforcement point for scanning, classification, and audit.

2. Scan all message roles. Most people scan user input. In practice, PII and secrets show up in system messages (from frameworks like LangChain), tool responses, and assistant messages from previous turns. OpenAI's "developer" role is another unscanned vector.

3. Deterministic detection over LLM judges. Using a second model to evaluate the first sounds elegant but creates a recursive trust problem. Regex + text normalization (reversing ~24 obfuscation techniques) is boring but reliable and adds ~250ms, not seconds.

4. Fail-closed by default. If your policy engine goes down, block everything. Don't fail open.

5. Presets, not configuration. Nobody writes custom Rego policies from scratch. Ship starter/standard/regulated presets and let teams tune.These came from 12 rounds of red-teaming our own pipeline — about 300 test cases across encoding bypasses, multilingual injection, Unicode evasion, and tool-result poisoning.

simonw2mo ago

Not yet, I'm still trying to figure out what the effective patterns for that are myself!

hsaliak2mo ago

I'd like to plug https://github.com/hsaliak/std_slop/blob/main/docs/mail_mode... my coding harness (std::slop)'s mail model (a poor name i admit). I believe this solves a fundamental problem of accummulating errors along with code in your project.

This brings the Linux Kernel style patch => discuss => merge by maintainer workflow to agents. You get bisect safe patches you 'review' and provide feedback and approve.

While a SKILL could mimic this, being built in allows me to place access control and 'gate' destructive actions so the LLM is forced to follow this workflow. Overall, this works really well for me. I am able to get bisect-safe patches, and then review / re-roll them until I get exactly what I want, then I merge them.

Sure this may be the path to software factories, but it scales 'enough' for medium size projects and I've been able to build in a way that I maintain strong understanding of the code that goes in.

bhaktatejas9222mo ago

have loved simon wilson for a long long time and still do. These patterns are all out of date by at least a year - the best devs I know were using Claude 3.5 like this

simonw2mo ago

Good. If they worked with Sonnet 3.5 a year ago that means they have sticking power and are worth writing about today.

sdevonoes2mo ago

Is there anything about reviewing the generated code? Not by the author but by another human being.

Colleagues don’t usually like to review AI generated code. If they use AI to review code, then that misses the point of doing the review. If they do the review manually (the old way) it becomes a bottleneck (we are faster at producing code now than we are at reviewing it)

simonw2mo ago

This chapter describes a technique for making code reviews less mentally burdensome: https://simonwillison.net/guides/agentic-engineering-pattern...

I'm hoping to add more on that topic as I discover other patterns that are useful there.

bavell2mo ago

Asking for a walkthrough of the codebase? Sure you linked to the right page?

I was expecting tips on code review instead based on your comment and GP.

simonw2mo ago

It's the closest I have to touching on code review so far.

MickeyShmueli2mo ago

the tautological test problem someone mentioned, i've found the easiest fix is to literally make the test fail first before letting the agent fix it

like don't ask it to "write tests for this function", instead give it a function that's deliberately broken in a specific way, make it write a test that catches that bug, verify the test actually fails, THEN fix the function

this forces the test to be meaningful because it has to detect a real failure mode. if the agent can't make the test fail by breaking the code, the test is useless

the other thing that helps is being really specific about edge cases upfront. instead of "write tests for this API endpoint", say "write tests that verify it returns 400 when the email field is missing, returns 409 when the email already exists, returns 422 when the email is malformed" etc

agents are weirdly good at implementing specific test scenarios but terrible at figuring out what scenarios actually matter. which honestly is the same problem junior devs have lol

pts_2mo ago

I really hate smelly statements like this or that is cheap now. They reek of carelessness.

dude2507112mo ago

Slop Engineering Patterns

sd92mo ago

Do you think there’s a chance that the hundreds of thousands or millions of developers - real developers - using these tools, might actually find them useful?

Dismissing everything AI as slop strikes me as an attitude that is not going to age well. You’ll miss the boat when it does come (and I believe it already has).

dude2507112mo ago

> You’ll miss the boat when it does come

Is the boat:

1) unmissable since the tools get better all the time and are intelligent

2) nearly-impossible to board since the tools will replace most of the developers

3) a boat of small productivity improvements?

sd92mo ago

Personally today I think it’s 3.

Eventually I do think it will be 2.

I think you’ve got to make hay while the sun shines. Nobody knows how this is all going to play out, I just want to make sure I’m at the forefront of it.

1 more reply

lelanthran2mo ago

> You’ll miss the boat when it does come (and I believe it already has).

That's fine; these boats are coming daily now. I'll catch the next one if I need to.

Madmallard2mo ago

patterns that may help increase subjective perception of reliability from non-deterministic text generators trained on the theft of millions of developer's work for the past 25 years.

logicprog2mo ago

I think it's nonsensical to insist that it would only be a subjective improvement. The tests either exist and ensure that there aren't bugs in certain areas, or they don't. The agent is either in a feedback loop with those tests and continues to work until it has satisfied them or it doesn't.

Madmallard2mo ago

That sounds like a very specific implementation strategy related to TDD

logicprog2mo ago

Red-Green TDD is one of the main "agent patterns" Simon proposes, so it seemed relevant.

Also, the same thing applies to feedback loops with compilers and linters as well: they provide objective feedback that then the AI goes and fixes, verifiably resolving the feedback.

Even with less verifiable things like using specifications, the fact that it relies on less objective grounding metrics doesn't mean there's no change in the model's behavior. I'm sure if you looked at the code that a model produced and the amount of intervention necessary to get there for a model that was asked to produce something without a specification versus with one, you would definitely see an objective difference on average. We're already getting objective studies regarding AGENTS.MD

j / k navigate · click thread line to collapse

310 comments

slaye2mo ago