* A task estimated at 4 hours → solved with one well specified prompt
* A 20 hour engineering effort → executed in about 3 hours
* A 3 month project → delivered in 1 month
These are clearly best case scenarios. They are not the norm, yet. But they demonstrate what is possible.
We have also seen what happens when things go wrong. Companies, including startups, come to us with broken systems and spaghetti code and architecture caused by weak prompts, unclear requirements, and no verification.
It is important to understand that the efficiency gains we are seeing do not come from the tools alone. They come from a specific combination:
1) Engineers who have spent 20 years building everything from robotics to enterprise-scale technology. You cannot give a perfect instruction to an AI if you do not know what perfect looks like in a production environment.
2) A technical prompt should not be treated as a quick input or question. It is a detailed specification that requires experience and deliberate thinking.
3) Knowing the right combination of tools, workflows, and validation processes.
That said, some (many?) members of our team are dinosaurs in the software engineering world. They bring a ton of experience but are used to tools from 15 years ago and don't like change. We really had to push AI adoption (mostly Cursor and Claude Code) on them. It’s still an ongoing process, and probably will be for a while.
Right now, Claude is getting trained by hundreds of thousands of programmers showing it how to ask the right architecture + PM questions.
They're just patterns, like anything else in our industry, and most of them are pretty standard patterns.
Like when I think back on 20 years of software architecture and BA work, I've done the same thing over and over. I must have implemented 4 PO systems, 3 different custom chat systems, SMS systems for reminders, monthly summary emails, etc.
We (12+ people on the team) see more LOC and faster first drafts, but also more review work. PRs look done early, but often hide shallow thinking or edge cases. Velocity goes up on paper. Review fatigue goes up too.
The best teams treat AI like a junior dev with infinite energy. Great for boilerplate and refactors, but you still need ownership. Otherwise you just ship bugs faster, which is not great honestly.
Wdyt?
At first, it proceeded very quickly. Using agents, the team were able to generate a lot of code very fast, and so they were checking off requirements at an amazing pace. PRs were rubber stamped, and I found myself arguing with copy/pasted answers from an agent most of the time I tried to offer feedback.
As the components started to get more integrated, things started breaking. At first these were obvious things with easy fixes, like some code calling other code with wrong arguments, and the coding agents could handle those. But a lot of the code was written in the overly-defensive style agents were fond of, so there were a lot more subtle errors. Things like the agent adding code to substitute an invalid default value in instead of erroring out, far away from where that value was causing other errors.
At this point, the agents started making things strictly worse because they couldn't fit that much code in their context. Instead of actually fixing bugs, they'd catch any exceptions and substitute in more defaults. There was some manual work by some engineers to remove a lot of the defensive code, but they could not keep up with the agents. This is also about when the team discovered that most of the tests were effectively "assert true" because they mocked out so much.
We did ship the project, but it shipped in an incredibly buggy state, and also the performance was terrible. And, as I said, it's now being wound down. That's probably the right thing to do because it would be easier to restart from scratch than try to make sense of the mess we ended up with. Agents were used to write the documentation, and very little of it is comprehensible.
We did screw some things up. People were so enthusiastic about agents, and they produced so much code so fast, that code reviews were essentially non-existent. Instead of taking action on feedback in the reviews, a lot of the time there was some LLM-generated "won't do" response that sounded plausible enough that it could convince managers that the reviewers were slowing things down. We also didn't explicitly figure out things like how error-handling or logging should work ahead of time, and so what the agents did was all over the place depending on what was in their context.
Maybe the whole mess was a necessary learning as we figure out these new ways of working. Personally I'm still using the coding agents, but very selectively to "fill-in-the-blanks" on code where I know what it should look like, but don't need to write it all by hand myself.
Various possibilities I suppose - I could just be overly skeptical, or failures might be kept on the down low, or perhaps most likely: many companies haven't actually reached the point where the hidden tech debt of using these things comes full circle.
Having been through it, what's your current impression of the success stories when you come across them?
Several different engineering teams from different parts of the company had to come together for this, and the overall architecture was modular, so there was a lot of complexity before we had to start integrating. We have some company-wide standards and conventions, but they don't cover everything. To work on the code, you might need to know module A does something one way and module B does it in a different way because different teams were involved. That was implicit in how human engineers worked on it, and so it wasn't explicitly explained to the coding agents.
The project was in the life sciences space, and the quality of code in the training data has to be worse than something like a B2B SaaS app. A lot of code in the domain is written by scientists, not software engineers, and only needs to work long enough to publish the paper. So any code an LLM writes is going to look like that by default unless an engineer is paying attention.
I don't know that either of those would be insurmountable if the company were willing to burn more tokens, but I'd guess it's an order of magnitude more than we spent already.
There are politics as well. There have been other changes in the company, and it seems like the current leadership wants to free up resources to work on completely different things, so there's no will to throw more tokens at untangling the mess.
I don't disbelieve the success stories, but I think most of them are either at the level of following already successful patterns instead of doing much novel, or from companies with much bigger budgets for inference. If Anthropic burns a bunch of money to make a C compiler, they can make it back from increased investor hype, but most companies are not in that position.
Also funny to hear reactions when I'm grepping code for a function in front of someone and I hear "wow you look for code I would just ask Gemini" topkek
The LLM-produced code I've seen thus far (from moderate feature PRs up to and including entire services), tends twords 'too large' or 'too complex' to properly vet by their small team of creators; PRs against existing codebases are often riddled with minor, seemingly unrelated changes or worse: large and/or subtle test suite alterations that "all pass" but contain hidden assumptions in conflict with reality and your business requirements. Numerous edge cases go entirely missed, overly trivial things are validated instead, etc. Feedback loops with the system certainly improves output but is frustrating and in many cases no faster than just writing the code yourself. At this stage you still need a human in the loop, unless you're still in the earliest stages of building a product. This being HN I'm sure someone around here is employing LLMs successfully in those cases, but the story for established orgs tends to be more complex, especially in traditionally risk averse fields like healthcare, billing, credit card processing, defense, etc. All fields I've worked in.
The sheer amount of code these LLM systems produce in aggregate means that we have a deficit of cognitive spoons across our orgs to properly review and test all these changes that are being pumped out. In other words: more bugs end up getting found in the field instead of earlier because we're allowing our generators to validate themselves due to resource constraints.
Can we produce code faster than ever before? Sure, but that was never the real bottleneck to begin with, at least in the orgs that I've operated within the past 2 decades.
To me, every new line of code needs an inherent justification for its existence. Code is often as much a liability as it is an asset. Think on several axis like risk of security vulnerabilities or cognitive load limits when no one left in your org understands your vibe-coded system anymore, as it grows more unwieldy by the day. The business value of new code being introduced should outway its (not so) hidden cost of maintenance and risk of business disruption. With today's emphasis on using LLM output for "moving faster", we seem to be ignoring that critical risk/reward analysis and operating under the incorrect assumption that more code is universally a good thing.
So for the moment I'm just using LLMs for "rubber ducking" or spitballing/testing out ideas before committing to a course of action; an action that will ultimately get executed by a human in the short term at least.
That said, on a good day my AI can serve as a proxy for a semi-competent software engineer (with amnesia), and no worse than an actual rubber duck on the others.
(note that the above statements are my own personal observations and are not intended to represent or express any statement or opinion held by my employer, etc, etc.)