One of the biggest struggles I have on my team is coworkers straight up vibing parts of the code and not understanding or guiding the architecture of subsystems. Or at least, not writing code in a way that is meant to be understood by others.
Then when I go through the code and provide extensive feedback (mostly architectural and highlighting odd inconsistencies with the code additions) I'm met with much pushback because "it works, why change it"? Not to mention the sheer size of prs ballooning in recent months.
The end result is me being the bottleneck because I can't keep up with the "pace" of code being generated, and feeling a lot of discomfort and pressure to lower my standards.
I've thought about using a code review agent to review and act as me in proxy, but not being able to control the exact output worries me. And I don't like the lack of human touch it provides. Maybe someone has advice on a humane way to handle this problem.
If you accelerate the pace of code creation it inevitably creates bottlenecks elsewhere. Code review is by far the biggest of those right now.
There may be an argument for leaning less on code review. When code is expensive to produce and is likely to stay in production for many years it's obviously important to review it very carefully. If code is cheap and can be inexpensively replaced maybe we can lower our review standards?
But I don't want to lower my standards! I want the code I'm producing with coding agents to be better than the code I would produce without them.
There are some aspects of code review that you cannot skimp on. Things like coding standards may not matter as much, but security review will never be optional.
I've recently been wondering what we can learn from security teams at large companies. Once you have dozens or hundreds of teams shipping features at the same time - teams with varying levels of experience - you can no longer trust those teams not to make mistakes. I expect that the same strategies used by security teams at Facebook/Google-scale organizations could now be relevant to smaller organizations where coding agents are responsible for increasing amounts of code.
Generally though I think this is very much an unsolved problem. I hope to document the effective patterns for this as they emerge.
If that's true, then I would think the emphasis in code review should be more on test quality and verifying that the spec is captured accurately, and as you suggest, the actual implementation is less important.
Agree with everything else you said except this. In my opinion, this assumes code becomes more like a consumable as code-production costs reduce. But I don't think that's the case. Incorrect, but not visibly incorrect, code will sit in place for years.
I don't care how cheap it is to replace the incorrect code when it's modifying my bank account or keeping my lights on.
I.e. have a `planning/designs/unbuilt/...` folder that contains markdown descriptions of features/changes that would have gotten a PR. Now do the review at the design level.
Weirdly this article doesn't really talk about the main agentic pattern
- Plan (really important to start with a plan before code changes). iteratively build a plan to implement something. You can also have a colelctive review of the plan, make sure its what you want and there is guidance about how it should implement in terms of architecture (should also be pulling on pre existing context about your architecure /ccoding standards), what testing should be built. Make sure the agent reviews the plan, ask the agent to make suggestions and ask questions
- Execute. Make the agent (or multiple agents) execute on the plan
- Test / Fix cycle
- Code Review / Refactor
- Generate Test Guidance for QA
Then your deliverables are Code / Feature context documentation / Test Guidance + evolving your global/project context
That shifts where rigor needs to live. The article focuses on planning patterns before code generation, which matters. But I'd argue the merge gate is equally important and massively underinvested. Right now the merge decision for most teams is: one person clicks Approve after a quick scan. That's the same process whether the PR is a trivial config change or a critical auth refactor, whether it came from a trusted agent or an unknown one.
The teams I've seen handle this well invest in proportional review. Not every change gets the same scrutiny. They define risk dimensions (what files changed, what agent generated it, how complex is the diff) and route PRs to different review intensities based on that score. The planning patterns in the article are upstream. But the merge governance pattern is downstream, and it's where most of the production risk actually lives.
To the specification debate: I've found that detailed specs help less when you have good merge governance, because bad output gets caught and rejected at the merge gate rather than requiring perfect input at the spec stage.
The problem is Claude Code has a planning mode baked in, which works really well but is quite custom to how Claude Code likes to do things.
When I describe it as a pattern I want to stretch a little beyond the current default implementation in one of the most popular coding agents.
Yea, a big part of my planning has included what verification steps will be necessary along the way or at the end. No plan gets executed without that and I often ask for specific focus on this aspect in plan mode.
Always, even before all this madness. It sounds more like a function of these teams CR process rather than agents writing the code. Sometimes super large prs are necessary and I've always requested a 30 minute meeting to discuss.
I don't see this as an issue, just noise. Reduce the PR footprint. If not possible meet with the engineer(s)
When shipping pressure comes, I’ve seen this to be the first thing to go. Despite formalizing ownership standards, etc… people on both the submitting and reviewing end just give up understanding Ai slop when management says they need to hit a deadline.
Probably no company would actually do this, but I wonder if we should actually actively test the submitter’s understanding of the code submitted somehow as a prerequisite to moving a PR to ready for review. I’m not sure if it will be actually hopeful, enforcing people to understand the code, but maybe at least we’ll put the cultural expectation upfront and center?
Now, however, we know how that played out in the case of assembly language. The fact of the matter is that only a very tiny fraction of software engineers give the structure of the compiled assembly code even passing thought. Our ability to generate assembly code is so great that we don't care about the end result. We only care about its properties...i.e. that it runs efficiently enough and does what we want. I could easily see the AI software development revolution ending up the same way. Does it really matter if the code generated by AI agents is DRY and has good design if we can easily recreate it from scratch in a matter of minutes/hours? As much as I love the craft and process of creating a beautiful codebase, I think we have to seriously consider and plan for a future where that approach is dramatically less efficient than other AI-enabled approaches.
There are plenty of orgs using AI who still care about architecture and having easily human-readable, human-maintainable code. Maybe that's becoming an anachronism, and those firms will go the way of the Brontosaurus. Maybe it will be a competitive advantage. Who knows?
¹ "Make it work, make it right, make it fast."
- A lot more linting rules than ever before, also custom rule sets that do more org and project level validations.
- Harder types enforcement in type optional languages , Stronger and deeper typing in all of them .
- beyond unit tests - test quality coverage tooling like mutation testing(stryker) and property based testing (quickcheck) if you can go that precise
- much more dx scripts and build harnesses that are specific to org and repo practices that usually junior/new devs learn over time
- On dynamic side , per pull requests environments with e2e tests that agents can validate against and iterate when things don’t work.
- documentation generation and skill curation. After doing a batch of pull requests reviews I will spend time in seeing where the gaps are in repo skills and agents.
All this becomes pre-commit heavy, and laptops cannot keep up in monorepos, so we ended up doing more remote containers on beefy machines and investing and also task caching (nx/turborepo have this )
Reviews (agentic or human) have their uses , but doing this with reviews is just high latency, inefficient and also tends to miss things and we become the bottleneck.
Earlier the coder(human or agent) gets repeatable consistent feedback it is better
But more proactively, if people aren't going to write their own code, I think there needs to be a review process around their prompts, before they generate any code at all. Make this a formal process, generate the task list you're going to feed to your LLM, write a spec, and that should be reviewed. This is not a substitute for code reviews, but it tends to ensure that there are only nitpick issues left, not major violations of how the system is intended to be architected.
I know they won’t stop using AI so giving them a directives file that I’ve tried out might at least increase the quality of the output and lower my reviewing burden.
Open to other ideas too :)
Also, we only allow engineers to commit (agent generated) code. Designers just come up with suggestions, engineers take it and ensure it fits our architecture.
We do have a huge codebase. We are teaching Claude Code with CLAUDE.md's and now also <feature>.spec.md (often a summary of the implementation plan).
In the end, engineers are responsible.
This is an educational problem, and is unlikely to be easy to fix in your team (though I might be wrong). I would suggest to change to a team or company with a culture that values being able to reason about one’s software.
>but not being able to control the exact output worries me
Why?
They have to be responsible for what they push.
In one of my experiments I had the simple goal of "making Linux binaries smaller to download using better compression" [1]. Compression is perfect for this. Easily validated (binary -> compress -> decompress -> binary) so each iteration should make a dent otherwise the attempt is thrown out.
Lessons I learned from my attempts:
- Do not micro-manage. AI is probably good at coming up with ideas and does not need your input too much
- Test harness is everything, if you don't have a way of validating the work, the loop will go stray
- Let the iterations experiment. Let AI explore ideas and break things in its experiment. The iteration might take longer but those experiments are valuable for the next iteration
- Keep some .md files as scratch pad in between sessions so each iteration in the loop can learn from previous experiments and attempts
Good news - agents are good at open ended adding new tests and finding bugs. Do that. Also do unit tests and playwright. Testing everything via web driving seems insane pre agents but now its more than doable.
This is the most important piece to using AI coding agents. They are truly magical machines that can make easy work of a large number of development, general purpose computing, and data collection tasks, but without deterministic and executable checks and tests, you can't guarantee anything from one iteration of the loop to the next.
The ability to test their work reliably is a tool, if you don't give them that, it's kinda silly to expect any kind of quality output.
I've dipped into agentic work now and again, but never been very impressed with the output (well, that there is any functioning output is insanely impressive, but it isn't code I want to be on the hook for complaining).
I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
Anyway, really looking forward to trying some if these patterns as the book develops to see if that makes a difference. Understanding how other peopke really use these tools is a big gap for me.
I think this is the main point where many people’s work differs. Most of my work I know roughly what needs changing and how things are structured but I jump between codebases often enough that I can’t always remember the exact classes/functions where changes are needed. But I can vaguely gesture at those specific changes that need to be made and have the AI find the places that need changing and then I can review the result.
I rarely get the luxury of working in a single codebase for a long enough period of time to get so familiar with it that I can jump to particular functions without much thought. That means AI is usually a better starting point than me fumbling around trying to find what I think exists but I don’t know where it is.
I'm thinking about how to solve the problem and how to express it in the programming language such that it is easy to maintain. Getting someone/something else to do that doesn't help me.
But different strokes for different folks, I suppose.
And I have the AI deal with "knowing how to do it" as well. Often it's slower to have it do enough research to know how to do it, but my time is more expensive than Claude's time, and so as long as I'm not sitting around waiting it's a net win.
I’m not sure it’s really true in practice yet, but that would certainly be the claim.
I think trying agents to do larger tasks was always very hit or miss, up to about the end of last year.
In the past couple of months I have found them to have gotten a lot better (and I'm not the only one).
My experience with what coding assistants are good for shifted from:
smart autocomplete -> targeted changes/additions -> full engineering
To answer your question, I’ve tried both Claude code and Antigravity in the last 2 weeks and I’m still finding them struggling. AG with Gemini regularly gets stuck on simple issues and loops until I run out of requests, and Claude still just regularly goes on wild tangents not actually solving the problem.
Pretty recently (a couple weeks ago). I give agentic workflows a go every couple of weeks or so.
I should say, I don't find them abysmal, but I tend to work in codebases where I understand them, and the patterns really well. The use cases I've tried so far, do sort of work, just not yet at least, faster than I'm able to actual write the code myself.
> smart autocomplete -> targeted changes/additions -> full engineering
Define "full engineering". Because if you say "full engineering" I would expect the agent to get some expected product output details as input and produce all by itself the right implementation for the context (i.e. company) it lives in.
> I hear a lot of people saying the same, but similarly a bunch of people I respect saying they barely write code anymore. It feels a little tricky to square these up sometimes.
It squares up just fine.
You ever read a blog post or comment and think "Yeah, this is definitely AI generated"? If you can recognise it, would you accept a blog post, reviewed by you, for your own blog/site?
I won't; I'll think "eww" and rewrite.
The developers with good AI experiences don't get the same "eww" feeling when reading AI-generated code. The developers with poor AI experiences get that "eww" feeling all the time when reviewing AI code and decide not to accept the code.
Well, that's my theory anyway.
I do this with code too.
In my experience, this heavily depends on the task, and there's a massive chasm between tasks where it's a good and bad fit. I can definitely imagine people working only on one side of this chasm and being perplexed by the other side.
I don’t think you have to square them because those sentiments are coming from different people. They are also coming from people at different points along the adoption curve. If you are struggling and you see other people struggling at the beginning of the adoption curve it can be quite difficult difficult to understand someone who is further along and does not appear to be struggling.
I think a lot of folks who have struggled with these tools do so because both critics and boosters create unrealistic expectations.
What I recommend is you keep trying. This is a new skill set. It is a different skill set. Which other skills that existed in the past remain necessary is not known.
1) Having review loops between agents (spawn separate "reviewer" agents) and clear tests / eval criteria improved results quite a bit for me. 2) Reviewing manually and giving instructions for improvements is necessary to have code I can own
I’ve yet to see these things do well on anything but trivial boilerplate.
- Through the last two decades of the 20th century, Moore’s Law held and ensured that more transistors could be packed into next year’s chips that could run at faster and faster clock speeds. Software floated on a rising tide of hardware performance so writing fast code wasn’t always worth the effort.
- Power consumption doesn’t vary with transistor density but varies with the cube of clock frequency, so by the early 2000s Intel hit a wall and couldn’t push the clock above ~4GHz with normal heat dissipation methods. Multi-core processors were the only way to keep the performance increasing year after year.
- Up to this point the CPU could squeeze out performance increases by parallelizing sequential code through clever scheduling tricks (and compilers could provide an assist by unrolling loops) but with multiple cores software developers could no longer pretend that concurrent programming was only something that academics and HPC clusters cared about.
CS curricula are mostly still stuck in the early 2000s, or at least it feels that way. We teach big-O and use it to show that mergesort or quicksort will beat the pants off of bubble sort, but topics like Amdahl’s Law are buried in an upper-level elective when in fact it is much more directly relevant to the performance of real code, on real present-day workloads, than a typical big-O analysis.
In any case, I used all this as justification for teaching bitonic sort to 2nd and 3rd year undergrads.
My point here is that Simon’s assertion that “code is cheap” feels a lot like the kind of paradigm shift that comes from realizing that in a world with easily accessible massively parallel compute hardware, the things that matter for writing performant software have completely shifted: minimizing branching and data dependencies produces code that looks profoundly different than what most developers are used to. e.g. running 5 linear passes over a column might actually be faster than a single merged pass if those 5 passes touch different memory and the merged pass has to wait to shuffle all that data in and out of the cache because it doesn’t fit.
What all this means for the software development process I can’t say, but the payoff will be tremendous (10-100x, just like with properly parallelized code) for those who can see the new paradigm first and exploit it.
It's tricky though. Take "red/green TDD" for example - it's perfectly possible that models will start defaulting to doing that anyway pretty soon.
In that case it's only three words so it doesn't feel hugely wasteful if it turns out not to be necessary - and there's still value in understanding what it means even if you no longer have to explicitly tell the agents to do it.
> A comprehensive test suite is by far the most effective way to keep those features working.
there is no mention at all about LLMs' tendency to write tautological tests--tests that pass because they are defined to pass. Or, tests that are not at all relevant or useful, and are ultimately noise in the codebase wasting cycles on every CI run. Sometimes to pass the tests the model might even hardcode a value in a unit test itself!IMO this section is a great place to show how we as humans can guide the LLM toward a rigorous test suite, rather than one that has a lot of "coverage" but doesn't actually provide sound guarantees about behavior.
- none of the "final" fields have changed after calling each method
- these two immutable objects we just confirmed differ on a property are not the same object
In addition to multiple tests with essentially identical code, multiple test classes with largely duplicated tests etc.0: https://www.codewithjason.com/examples-pointless-rspec-tests...
* no-op tests
* unit tests labeled as integration tests
* skipped tests set to skip because they were failing and the agent didn’t want to fix them
* tests that can never fail
Probably at any given time the tests are 2-4% broken. I’d say about 10% of one-shot tests are bogus if you’re just working w spec + chat and don’t have extra testing harnesses.
Worse: once you have one "bad apple" in your pile of tests, it decreases trust in the _whole batch of tests_. Each time a test passes, you have to think if it's a bad test...
Many times I've observed that the tests added by the model simply pass as part of the changes, but still pass even when those changes are no longer applied.
It can still cheat, but it's less likely to cheat.
I have a hard enough time getting humans to write tests like this…
As my projects were growing in complexity and scope, I found myself worrying that we were building things that would subtly break other parts of the application. Because of the limited context windows, it was clear that after a certain size, Claude kind of stops understanding how the work you're doing interacts with the rest of the system. Tests help protect against that.
Red/green TDD specifically ensures that the current work is quite focused on the thing that you're actually trying to accomplish, in that you can observe a concrete change in behaviour as a result of the change, with the added benefit of growing the test suite over time.
It's also easier than ever to create comprehensive integration test suites - my most valuable tests are tests that test entire user facing workflows with only UI elements, using a real backend.
I’ve always been partial to integration tests too. Hand coding made integration tests feel bad; you’re almost doubling the code output in some cases - especially if you end up needing to mock a bunch of servers. Nowadays that’s cheap, which is super helpful.
The only problem is... they still take much longer to _run_ than unit tests, and they do tend to be more flaky (although Claude is helpful in fixing flaky tests too). I'm grateful for the extra safety, but it makes deployments that much slower. I've not really found a solution to that part beyond parallelising.
"deeply understand this codebase, clearly noting async/sync nature, entry points and external integration. Once understood prepare for follow up questions from me in a rapid fire pattern, your goal is to keep responses concise and always cite code snippets to ensure responses are factual and not hallucinated. With every response ask me if this particular piece of knowledge should be persistent into codebase.md"
Both the concise and structure nature (code snippets) help me gain knowledge of the entire codebase - as I progressively ask complex questions on the codebase.
Take a guitar, for example. You don't industrialize the manufacture of guitars by speeding up the same practices that artisans used to build them. You don't create machines that resemble individual artisans in their previous roles (like everyone seems to be trying to do with AI and software). You become Leo Fender, and you design a new kind of guitar that is made to be manufactured at another level of scale magnitude. You need to be Leo Fender though (not a talented guitarrist, but definitely a technical master).
To me, it sounds too early to describe patterns, since we haven't met the Ford/Fender/etc equivalent of this yet. I do appreciate the attempt though.
When you see a sorting machine that jiggles lots of pieces so they align, that's because pieces don't align naturally. It's a fix for chaos, for things that naturally behave like "doing whatever".
Industrial machinery is full of this in all sorts of places. Even in precision engineering. Press-fits and interference-fits, etc. We deal with lack of precision all the time.
Engineers are _absolute chads_ on this kind of thing. We tame chaos like no other professional.
And actually, these tools actually work, , because 99% of people still don’t really know how to prompt agents well and end up doing things like “pls fix this, it’s not working”.
One thing that worked well for us was going back to how a human team would approach it: write a product spec first (expected behavior, constraints, acceptance criteria, etc), use AI to refine that spec, and only then hand it to an opinionated flow of agents that reflect a human team to implement.
For a high level description of what this new way of engineering is about: https://substack.com/@shreddd/p-189554031
The thing I keep wrestling with is where exactly to place those checkpoints. Too frequent and you've just built a slow pair programmer. Too infrequent and you're doing expensive archaeology to figure out where it went sideways. We've landed on "before any irreversible action" as a useful heuristic, but that requires the agent to have some model of what's irreversible, which is its own can of worms.
Has anyone found a principled way to communicate implicit codebase conventions to an agent beyond just dumping a CLAUDE.md or similar file? We've tried encoding constraints as linter rules but that only catches surface stuff, not architectural intent.
I wanted something I could use to objectively decide if one test (or gate, as I call them) is better than another, and how do they work as a holistic system.
My personal tool encodes a workflow that has stages and gates. The gates enforce handoff. Once I did this I went from ~73% first-pass approval to over 90% just by adding structured checks at stage boundaries.
My hope is that we can have a common vocabulary to talk about this, so I wrote up the data and the framework that fell out of it: https://michael.roth.rocks/research/trust-topology/
agents role (Orchestrator, QA etc.), agents communication, thinking patterns, iteration patterns, feature folders, time-aware changelog tracking, prompt enforcing, real time steering.
We might really need a public Wiki for that (C2 [1] style)
[1]: https://wiki.c2.com/
Other things that I feel are useful:
- Very strict typing/static analysis
- Denying tool usage with a hook telling the agent why+what they should do (instead of simple denial, or dangerously accepting everything)
- Using different models for code review
Running multiple agents concurrently (QA, content, conversions, distribution), we hit this exact wall - agents didn't know what other agents had done, creating duplicate work and missed context.
Solved it with a stupidly simple approach: 1. Single TODO.md with "DO NOW" (unblocked), "BLOCKED", "DONE" sections 2. Named output files per agent type (qa-status.md, scout-finds.md, etc) 3. active-tasks.md for crash recovery - breadcrumbs from interrupted runs 4. Daily memory logs with session IDs for searchability
The key: File-based state is deterministic. After a crash, the next agent reads identical input, same decision rules, same output structure. Zero state collision, zero "what was I thinking?"
Deployment: ~8 agents on cron. They wake, read files, work, write results, die. No persistent terminal. No coordination overhead.
This turned "5 terminal tabs with unmanageable logs" into "grep yesterday's log, see exactly what happened."
Patterns + implementation details: https://osolobo.com/first-ai-agent-guide/
A broken test doesn’t make the agentic coding tool go “ooooh I made a bad assumption” any more than a type error or linter does
All a broken test does it prompt me to prompt back “fix tests”
I have no clue which one broke or why or what was missed, and it doesnt matter. Actual regressions are different and not dependent on these tests, and I follow along from type errors and LLM observability
I distilled multiple software books into these flows and skills. With more books to come.
Here is an example https://github.com/ryanthedev/code-foundations
https://simonwillison.net/guides/agentic-engineering-pattern...
*Hoard things you know how to do*
It will make everything faster for you - even if you can ask AI it will be more costly to do it from scratch.
Also it is nothing new under the sun. In old days a developer would have his own stack of libraries and books and would not need to do NPM i for someone elses code because he would have bunch of own libraries ready to go. Of course one can say, there will always be a library that is better then yours ... but is it? :)
Has anyone setup a smooth agent setup for game art assets generation? (AI models already do great for shaders and VFX, but I would really love to automate model + texture + animation pipeline)
- tell the agent to write a plan, review the plan, tell the agent to implement the plan
- allow the agent to “self discover” the test harness (eg. “Validate this c compiler against gcc”)
- queue a bunch of tasks with // todo … and yolo “fix all the todo tasks”
- validate against a known output (“translate this to rust and ensure it emits the same byte or byte output as you go”)
- pick a suitable language for the task (“go is best for this task because I tried several languages and it did the best for this domain in go”)
You can "pip install ziglang" and get the right version for different platforms too.
Please refer to https://ziglang.org/download/0.15.1/release-notes.html#Incre...
This has nothing to do with agentic engineering. This is just normal software development. Everybody wants faster compilation speed
So far I only have one: Inflicting unreviewed code on collaborators, aka dumping a thousand line PR without even making sure it works first https://simonwillison.net/guides/agentic-engineering-pattern...
It's true that in my company we're not building rockets or defense systems, maybe you guys are and in those scenarios it's less useful. But for typical LoB and/or consumer-facing software, AI is crushing it. Where I used to need 3 devs, now I just need one (and the support team around it: PM, BA, QA, Designer). For my business, AI has been a game changer.
Like an engineer overseeing the construction of a bridge, the job is not to lay bricks. It is to ensure the structure does not collapse.
The marginal cost of code is collapsing. That single fact changes everything.
Quite a heavy-lifting word here. You understand why people flagged that post right? It's painfully non-human. I'm all for utilizing LLM, but I highly suggest you read Simon's posts. He's obviously a heavy AI user, but even his blog posts aren't that inorganic and that's why he became the new HN blog babe.
[0]: I personally believe Simon writes with his own voice, but who knows?
There's no actual way to determine if any words are from a silicon token generator or meat-based generator. It's not AI, it's human! Emdash. You're absolutely right!
system failure.
I would not equate software engineering to "proper" engineering insofar as being uttered in the same sentence as mechanical, chemical, or electrical engineering.
The cost of code is collapsing because web development is not broadly rigorous, robust software was never a priority, and everyone knows it. The people complaining that AI isn't good enough yet don't grasp that neither are many who are in the profession currently.
I think the externalities are being ignored. Having time and money to train engineers is expensive. Having all the data of your users being stolen is a slap in the wrist.
So replacing those bad worekrs with AI is fine. Unless you remove the incentives to be fast instead of good, then yeah AI can be good enough for some cases.
Engineering is the practical application of science and mathematics to solve problems. It sounds like you're maybe describing construction management instead. I'm not denying that there's value here, but what you're espousing seems divorced from reality. Good luck vibecoding a nontrivial actuarial model, then having it to pass the laundry list of reviews and having large firms actually pick it up.
Thats a little harsh. I think most everyone would agree we're in a transformative time for engineering. Sure theres hype, but the adoption in our profession (assuming you're an engineer) isn't waning.
The claim here is profound: comprehension of the codebase at the function level is no longer necessary
It's not profound. It's not profound when I read the exact same awed blog post about how "agentic" is the future and you don't even need to know code anymore.It wasn't profound the first time, and it's even dumber that people keep repeating it - maybe they take all the time they saved not writing, and use it to not read.
https://www.slater.dev/2025/09/its-time-to-license-software-...
Shameless plug: I wrote one. https://marmelab.com/blog/2026/01/21/agent-experience.html
https://agentexperience.ax/ describes it as "refers to the holistic experience AI agents have when interacting with a product, platform, or system" which feels to me like a different concept to figuring out patterns for effectively using coding agents as a software engineer.
Test fail -> implement -> linter -> test pass
Another idea I've thought about using is docs driven development. So the instructions might look like..
Write doc for feat/bug > test fail > implement > lint > test pass
The "give it bash" pattern sounds scary until you realize the alternative is 47 intermediate tool calls that fail silently.
Letting the agent write and run scripts means the agent debugs when something breaks. The feedback loop tightens dramatically.
The trick is sandboxing + cost limits. Not preventing shell access.
Feels like it’s a lot of words to say what amounts to make the agent do the steps we know works well for building software.
They are, and that's deliberate.
Something I'm finding neat about working with coding agents is that most of the techniques that get better results out of agents are techniques that work for larger teams of humans too.
If you've already got great habits around automated testing, documentation, linting, red/green TDD, code review, clean atomic commits etc - you're going to get much better results out of coding agents as well.
My devious plan here is to teach people good software engineering while tricking them into thinking the book is about AI.
I am still not sold on agentic coding. We’ll probably get there within the next couple of years.
"Explain the codebase to a newcomer. What is the general structure, what are the important things to know, and what are some pointers for things to learn next?"
Once I saw the output I giddyup'd and haven't looked back.
Thank you Simon and I'm sure you would quickly fall off from #1 blogger on HN if you did. I insist on this for myself as well.
Somehow we are all getting really good at detecting "written by AI" with primal intuition.
The damn thing _talks_. You can just _speak_ to it. You can just ask it to do what you want.
COBOL's promise was that it was human-like text, so we wouldn't need programmers anymore.
The problem is that the average person doesn't know how what their actual problems are in sufficient detail to get a working solution. When you get down to breaking down that problem... you become a programmer.
The main lesson of COBOL is that it isn't the computer interface/language that necessitates a programmer.
Agreed. I've spent the last few years building an EMR at an actual agency and the idea that users know what they want and can articulate it to a degree that won't require ANY technical decisions is pure fantasy in my experience.
At my job, we use a lot of AI to literally move fast and break things when working on internal tools. The idea is that the surface area is low, rollbacks are fast, and the upside is a lot better than the downside (our end users get a better experience to help them do their job better).
But our bottleneck is still requirements for the project. We routinely run out of stuff to do and have to ask for new stuff or work on a different project.
But you're absolutely right. Most people (programmers, managers, etc) don't know exactly what problems need to be solved, or at least, struggle to communicate it adequately for it to be implemented well enough. They say they want X. But they haven't thought about the repercussions of it, or that it requires Y first. AI might be able to help there, but it will give a totally bogus answer if it does not have any context of the domain, which is almost never documented in code.
These are still very much so technical roles, but maybe we are becoming more "technical domain experts."
For example, "Generate me some repeatable code to ask system X for data about Y, pull out value Z, and submit it to system W."
I think attempts to document the most effective things to ask it to do in order to get to your overall goal, as well as what it is and is not good for, is probably worth doing. It would be bad if it turned into a whole consultant marketing OOP coaching clusterfuck. Yeah, but building some kind of community knowledge that these things aren't like, demigods, they have limitations and during things one way or the other with them can be better is probably a good thing. At the very least in theory would cut down some of the hype?
There's a lifecycle to these hype runs, even when the thing behind the hype is plenty real. We're still in the phase where if you criticize AI you get told you don't "get it", so people are holding back some of their criticisms because they won't be received well. In this case, I'm not talking about the criticisms of the people standing back and taking shots at the tech, I'm talking about the criticisms of those heavily using it.
At some point, the dam will break, and it will become acceptable, if not fashionable, to talk about the real problems the tech is creating. Right now there is only the tiniest trickle from the folk who just don't care how they are perceived, but once it becomes acceptable it'll be a flood.
And there are going to be problems that come from using vast quantities of AI on a code base, especially of the form "created so much code my AI couldn't handle it anymore and neither could any of the humans involved". There's going to need to be a discussion on techniques on how to handle this. There's going to be characteristic problems and solutions.
The thing that really makes this hard to track though is the tech itself is moving faster than this cycle does. But if the exponential curve turns into a sigmoid curve, we're going to start hearing about these problems. If we just get a few more incremental improvements on what we have now, there absolutely are going to be patterns as to how to use AI and some very strong anti-patterns that we'll discover, and there will be consultants, and little companies that will specialize in fixing the problems, and people who propose buzzword solutions and give lots of talks about it and attract an annoying following online, and all that jazz. Unless AI proceeds to the point that it can completely replace a senior engineer from top to bottom, this is inevitable.
That's essentially the thing we are calling "cognitive debt".
I have a chapter with one small thing to help address that here - https://simonwillison.net/guides/agentic-engineering-pattern... - but it's a much bigger topic and will require extensive exploration by the whole industry to figure out.
I feel like there’s a similar vibe coming with vibe coding. Just let the AI generate as much code as it wants, don’t check it because it doesn’t matter because only the LLM will be reading it anyway.
My gut tells me that
1. there will still be reasons for humans to understand the code for a long time,
2. even the LLM will struggle with modifying code last a certain size and complexity without good encapsulation and well thought out system architecture and design.
yes. It sucks but I think it's good for the next generation of tech industry employees to watch this. It's happening quickly so you get a 10 year timeline compressed into a few years which makes it easier to follow and expose. The bloggers will come, then speakers, then there will be books. Consultants will latch on and start initiatives at their clients. Once enough large enterprises are sold on it, there will come associations and certification bodies so a company can say "we have X certified abc on staff". Manifestos will be released, version numbers will be incremented so there's a steady flow of work writing books, doing trainings, and getting the next level certified.
This is standard issue tech industry stuff (and it probably happens everywhere else too) but compressed into a tighter timeline so you don't have to wait a decade to see it unfold.
It's not as simple an observation as you're making it out to be.
> You can just ask it to do what you want.
Yes, but very clearly, as any HN thread on AI shows, different people are having VERY different outcomes with it. And I suspect it is largely the misconception that it will magically "just do what you want" that leads to poor outcomes.
The techniques mentioned -- coding, docs, modularity etc. -- may seem obvious now, but only recently did we realize that the primary principle emerging is "what's good for humans is good for agents." That was not at all obvious when we started off. It is doubly counter-intuitive given the foremost caveat has been "Don't anthropomorphize AI." I'm finding that is actually a decent way to understand these models. They are unnervingly like us, yet not like us.
All that to say, AI is essentially black magic and it is not yet obvious how to use it well for all people and all use-cases, so yes, more exposition is warranted.
The context suggests the former, but your criticisms bear no relation to the linked content. If anything, your edict to "write tests first" is even more succinctly expressed as "Red/green TDD".
Doesn't it sound like the "right incantation"? That's the point of LLMs, they can understand (*) intent. You'd get the same result saying "do tdd" or "do the stuff everyone says they do but they don't, with the failing test first, don't remember the name, but you know what I'm saying innit?"
I'm perhaps uncharitable, and this article just happens to take the collateral damage, but I'm starting to see the same corruption that turned "At regular intervals, the team reflects on how to become more effective" into "Mandatory retro exactly once every fortnight, on a board with precisely three columns".
While I agree with the sentiment that we shouldn't make things more complicated by inventing fancy names, we also shouldn't pretend that software engineering has become super simple now. Building a great piece of software remains super hard to do and finding better techniques for it affords real study.
Your post is annoying me quite a bit because it's super unfair to the linked post. Simon Willison isn't trying to coin a new term, he's just trying to start a collection of useful patterns. "Agentic engineering" is just the obvious term for software engineering using agents. What would you call it, "just asking things"?
I was speaking from a software engineer's point of view, in the context of the article, where one of the "agentic" patterns is ... test-driven development? Which you summon out of the agent by saying ... "Do test-driven development", more or less?
> What would you call it, "just asking things"?
I'd call it software engineering. What makes good software didn't suddenly change, and there's fundamentally no secret sauce to getting an agent to do it. If I think a class does too many things, then I tell the agent "This class does too many things, move X and Y methods into a new class".
We are having simple and sensible stuff.
But then bunch of assholes who don't know better and just want to milk $$$ will come over and ruin it for everyone.
I suspect that this time around, management will expect the AI chatbot to explain these things to you, because who pays for anything anymore if the AI can do it all.
If the answer is just "Install Oh My Opencode and stick any decent model in it" then it doesn't work.
And honestly, the answer is just to install Oh My Opencode and Kimi K2.5 and get 90% of the performance of Opus for a fraction of the price.
Basically, it's Waterfall for Agents. Lots of Capitalized Words to signify something.
Also they constantly call it the BMAD Method, even though the M already stands for method.
But can it pass the butter?
I mean - yeah. So do humans. But it turns out that that a lot of humans require considerable process to productively organize too. A pet thesis of mine is that we are just (re-) discovering the usefulness of process and protocol.
There was already another attempt at agentic patterns earlier:
Absolute hot air garbage.
Secondly: this is a temporary vacuum state. We're only needed to bridge the gap.
I wouldn't be trying to be a consultant, I would be scurrying to ensure we have access to these tools once they're industrial. A "$5M button" to create any business function won't be within the reach of labor capital, but it will be for financial capital. That's the world we're headed to.
0: https://wiki.roshangeorge.dev/w/Blog/2025-12-01/Grounding_Yo...
Which is oddly close to how investment advice is given. If these techniques work so well, why give them up for free?
The thing I keep coming back to is that it's all code. Almost all white collar professions have at least some key outputs in code. Whether you are a store manager filling out reports or a marketing firm or a teacher, there is so much code.
This means you can give claude code a branded document template, fill it out, include images etc. and uploaded to our cloud hosting.
With this same guidance and taste, I'm doing close to the work of 5 people.
Setup: Claude code with full API access to all my digital spaces + tmux running 3-5 tasks in parallel
A nice way to realize why this AI wave hasn't produced massive economy growth, it is mostly touching parts of economy which are parasitic and can't really create growth.
1. Proxy-based governance. Route all LLM traffic through a governance layer. The agent never holds API keys directly — the proxy holds them and issues scoped, short-lived capability tokens (ES256, 60s TTL). Single enforcement point for scanning, classification, and audit.
2. Scan all message roles. Most people scan user input. In practice, PII and secrets show up in system messages (from frameworks like LangChain), tool responses, and assistant messages from previous turns. OpenAI's "developer" role is another unscanned vector.
3. Deterministic detection over LLM judges. Using a second model to evaluate the first sounds elegant but creates a recursive trust problem. Regex + text normalization (reversing ~24 obfuscation techniques) is boring but reliable and adds ~250ms, not seconds.
4. Fail-closed by default. If your policy engine goes down, block everything. Don't fail open.
5. Presets, not configuration. Nobody writes custom Rego policies from scratch. Ship starter/standard/regulated presets and let teams tune.These came from 12 rounds of red-teaming our own pipeline — about 300 test cases across encoding bypasses, multilingual injection, Unicode evasion, and tool-result poisoning.
This brings the Linux Kernel style patch => discuss => merge by maintainer workflow to agents. You get bisect safe patches you 'review' and provide feedback and approve.
While a SKILL could mimic this, being built in allows me to place access control and 'gate' destructive actions so the LLM is forced to follow this workflow. Overall, this works really well for me. I am able to get bisect-safe patches, and then review / re-roll them until I get exactly what I want, then I merge them.
Sure this may be the path to software factories, but it scales 'enough' for medium size projects and I've been able to build in a way that I maintain strong understanding of the code that goes in.
Colleagues don’t usually like to review AI generated code. If they use AI to review code, then that misses the point of doing the review. If they do the review manually (the old way) it becomes a bottleneck (we are faster at producing code now than we are at reviewing it)
I'm hoping to add more on that topic as I discover other patterns that are useful there.
I was expecting tips on code review instead based on your comment and GP.
like don't ask it to "write tests for this function", instead give it a function that's deliberately broken in a specific way, make it write a test that catches that bug, verify the test actually fails, THEN fix the function
this forces the test to be meaningful because it has to detect a real failure mode. if the agent can't make the test fail by breaking the code, the test is useless
the other thing that helps is being really specific about edge cases upfront. instead of "write tests for this API endpoint", say "write tests that verify it returns 400 when the email field is missing, returns 409 when the email already exists, returns 422 when the email is malformed" etc
agents are weirdly good at implementing specific test scenarios but terrible at figuring out what scenarios actually matter. which honestly is the same problem junior devs have lol
Dismissing everything AI as slop strikes me as an attitude that is not going to age well. You’ll miss the boat when it does come (and I believe it already has).
Is the boat:
1) unmissable since the tools get better all the time and are intelligent
or
2) nearly-impossible to board since the tools will replace most of the developers
or
3) a boat of small productivity improvements?
?