That is not true, and the proof is that LLMs _can_ reliably generate (relatively small amounts of) working code from relatively terse descriptions. Code is the detail being filled in. Furthermore, LLMs are the ultimate detail fillers, because they are language interpolation/extrapolation machines. And their popularity is precisely because they are usually very good at filling in details: LLMs use their vast knowledge to guess what detail to generate, so the result usually makes sense.
This doesn't detract much from the main point of the article though. Sometimes the interpolated detail is wrong (and indeterministic), so, if reliable result is to be achieved, important details have to be constrained, and for that they have to be specified. And whereas we have decades of tools and culture for coding, we largely don't have that for extremely detailed specs (except maybe at NASA or similar places). We could figure it out in the future, but we haven't yet.
LLMs can generate (relatively small amounts of) working code from relatively terse descriptions, but I don’t think they can do so _reliably_.
They’re more reliable the shorter the code fragment and the more common the code, but they do break down for complex descriptions. For example, try tweaking the description of a widely-known algorithm just a little bit and see how good the generated code follows the spec.
> Sometimes the interpolated detail is wrong (and indeterministic), so, if reliable result is to be achieved
Seems you agree they _cannot_ reliably generate (relatively small amounts of) working code from relatively terse descriptions
this works well for me
Only with well-known patterns that represent shared knowledge specified elsewhere. If the details they “fill in” each time differ in ways that change behavior, then the spec is deficient.
If we “figure out” how to write such detailed specs in the future, as you suggest, then that becomes the “code”.
After awhile, I think we all get a sense of not only the amount of micro-decisions you have to make will building stuff (even when you're intimate with the domain), but also the amount of assumptions you'll need to make about things you either don't know yet or haven't fully fleshed out.
I'm painfully aware of the assumptions I'm making nowadays and that definitely changes the way I build things. And while I love these tools, their ability to not only make assumptions, but over-engineer those assumptions can have disastrous effects.
I had Claude build me a zip code heat map given a data source and it did it spectacularly. Same with a route planner. But asking it build out medical procedure documentation configurations based off of a general plan DID NOT work as well as I had expected it would.
Also, I asked Claude about what the cron expression I wrote would do, and it got it wrong (which is expected because Azure Web Jobs uses a non-standard form). But even after telling it that it was wrong, and giving it the documentation to rely on, it still doubled down on the wrong answer.
It might be illuminating to see what a mathematically precise specification can and cannot do when it comes to generating programs. A major challenge in formal methods is proving that the program implements the specification faithfully, known as the specification gap. If you have a very high level and flexible specification language, such as TLA+, there is a lot of work to do to verify that the program you write meets the specification you wrote. For something like Synquid that is closer to the code there are more constraints on expressivity.
The point is that spoken language is not sufficiently precise to define a program.
Just because an LLM can fill in plausible details where sufficient detail is lacking doesn’t indicate that it’s solving the specification gap. If the program happens to implement the specification faithfully you got lucky. You still don’t actually know that’s true until you verify it.
It’s different with a narrow interface though: you can be very precise and very abstract with a good mathematical system for expressing specifications. It’s a lot more work and requires more training to do than filling in a markdown file and trying to coax the algorithm into outputting what you want through prose and fiction.
you just (correctly) negated your own claim
They can generate boilerplate, sure. Or they can expand out a known/named algorithm implementation, like pulling in a library. But neither of those is generating detail that wasn't there in the original (at most it pulls in the detail from somewhere in the training set).
Some of the missing pieces come from memory, knowing which topics I like to explore, some from the model itself, either baked in knowledge or what it picks up searching, but they can definitely take a vague, handwavy half baked idea and whip up a full app or game or whitepaper. Sometimes it's "exactly what I wanted!", other times it's "exactly the kind of thing I was talking about!"
Semantics and context and nuance are part and parcel of LLM capabilities. Superhuman in some areas, definitely subhuman in others.
AI is getting pretty competent and clever.
I'm absolutely on board with that. We probably need less weird and outlier decisions in designs for something that is a boring ass business website.
> Sometimes the interpolated detail is wrong (and indeterministic)
... You consider incorrect, non deterministic results to be "reliable"?
Heck, if I reimplement something I worked on a month ago it’s probably not going to be the exact same. Being non deterministic needn’t to be a problem, as long as it falls within certain boundaries and produces working results.
An awful lot of time has been put into the compiler to know which registers to use and how to juggle them. Is an LLM any different in its behavior (albeit different in how it was trained)? If not, then specs are just an even higher level programming language.
I think the difference is that, with a C compiler, when it gets it wrong I’ll have some terrible performance impact, it when the LLM gets it wrong, it will do something nobody wanted, like delete someone’s account or debit one account without crediting another.
Not a mere 5 years ago even tech people were chortling down their upturned noses that people were complaining that their jobs were being “taken”, and now that the turns have tabled, there is a bunch of denial, anger, and grief going on, maybe even some depression as many of the recently unemployed realize the current state of things.
It’s all easy to deride the inferiority of AI when you’re employed in a job doing things as you had been all your career, thinking you cannot be replaced… until you find yourself on the other side of the turn that has tabled.
This kind of thing happens at least once per day to me, maybe more.
I am not denying that it is useful, let me be clear. It is extremely convenient, especially for mechanical tasks. It has other advantages like quick exploration of other people's code, for example. If my employer didn't provide a corporate account for me, I would pay one from my own pocket.
That said, I agree with OP and the author that it is not reliable when producing code from specs. It does things right, I would say often. That might be good enough for some fields/people. It's good enough for me, too. I however review every line it produces, because I've seen it miss, often, as well.
It's some nice rhetoric, but you're not actually saying much.
LLMs make developers more efficient. That much is obvious to anyone who isn't blinded by fear.
But people will respond "but you still need developers!" True. You don't need nearly as many, though. In fact, with an LLM in their hands, the poor performers are more of a liability than ever. They'll be let go first.
But even the "smart" developers will be subsumed, as vastly more efficient companies outcompete the ones where they work.
Companies with slop-tolerant architectures will take over every industry. They'll have humans working there. But not many.
This is exactly the argument in Brooks' No Silver Bullet. I still believe that it holds. However, my observation is that many people don't really need that level of details. When one prompts an AI to "write me a to-do list app", what they really mean is that "write me a to-do list app that is better that I have imagined so far", which does not really require detailed spec.
If someone was making a serious request for a to-do list app, they presumably want it to do something different from or better than the dozens of to-do list apps that are already out there. Which would require them to somehow explain what that something was, assuming it's even possible.
For some problems, it is. Web front-end development, for example. If you specify what everything has to look like and what it does, that's close to code.
But there are classes of problems where the thing is easy to specify, but hard to do correctly, or fast, or reliably. Much low-level software is like that. Databases, file systems, even operating system kernels. Networking up to the transport layer. Garbage collection. Eventually-consistent systems. Parallel computation getting the same answer as serial computation. Those problems yield, with difficulty, to machine checked formalism.
In those areas, systems where AI components struggle to get code that will pass machine-checked proofs have potential.
I guess it depends on whether or not we want to make money, or otherwise, compete against others.
But that’s most of the time is not that they want it from objective technical reasons.
They want it because they want to see if they can push you. They do it „because they can”. They do it because later they can renegotiate or just nag and maybe pay less. Multiple reasons that are not technical.
What does "better" mean here? i suspect that "better" would be defined...in a spec
The compression ratio is the vibe coding gain.
I think that way of phrasing it makes it easier to think about boundaries of vibe coding.
"A class that represents (A) concept, using the (B) data structure and (C) algorithms for methods (D), in programming language (E)."
That's decodeable, at least to a narrow enough distribution.
"A commercially successful team communication app built around the concept of channels, like in IRC."
Without already knowing Slack, that's not decodable.
Thinking about what is missing is very helpful. Obviously, the business strategic positioning, non technical stakeholder inputs, UX design.
But I think it goes beyond that: In sufficiently complex apps, even purely technical "software engineering" decisions are to some degree learnt from experiment.
This also makes it more clear how to use AI coding effectively:
* Prompt in increments of components that can be encoded in a short prompt.
* If possible, add pre-existing information to the prompt (documentation, prior attempts at implementation).
"Informally, from the point of view of algorithmic information theory, the information content of a string is equivalent to the length of the most-compressed possible self-contained representation of that string. A self-contained representation is essentially a program—in some fixed but otherwise irrelevant universal programming language—that, when run, outputs the original string."
Where it gets tricky is the "self-contained" bit. It's only true with the model weights as a code book, e.g. to allow the LLM to "know about" Slack.
I love this. I've been circling this idea for a while and you put into words what I've struggled to describe.
> "A commercially successful team communication app built around the concept of channels, like in IRC." > Without already knowing Slack, that's not decodable.
I would like to suggest that implicit shared context matters here. Or rather, humans tend to assume more shared context than LLM's actually have, and that misleads us when it comes assessing the aforementioned decoder.
But I think it also suggests that there is a system that could be built with strong constraints and saliency that could really explode the compression ratio of vibe coding.
But there is an entire cohort of people who can think about specifying systems but lack the training to sdo so so using the current methods and see a lower barrier to entry in the natural language.
That doesn't mean the LLM is going to think on your behalf (although there is also a little bit of that involved and that's where stuff gets confusing) but it surely provides a completely different interface for turning your ideas into working machinery
"Specifying" is the load-bearing term there. They are describing what they want to some degree, how how specifically?
Nah, it will be extremely surprising if even 1 such a person exists.
On the other hand, there are lots of people that can write code, but still can't specify a system. In fact, if you keep increasing the size of the system, you will eventually fit every single programmer in that category.
The uninitiated can continue trying to clumsily refer to the same concepts, but with 100x the tokens, as they lack the same level of precision in their prompting. Anyone wanting to maximize their LLM productivity will start speaking in this unambiguous, highly information-dense dialect that optimizes their token usage and LLM spend...
Setting aside the problem of training, why bother prompting if you’re going to specify things so tightly that it resembles code?
[1] https://en.wikipedia.org/wiki/Lojban
[2] Someone speaking it: https://www.youtube.com/watch?v=lxQjwbUiM9w
Context pollution is a bigger problem.
E.g., those SKILL.md files that are tens of kilobytes long, as if being exhaustively verbose and rambling will somehow make the LLM smarter. (It won't, it will just dilute the context with irrelevant stuff.)
Some languages don't have this kind of vocabulary, because there aren't enough speakers that deal with technical things in a given area (and those that do, use another language to communicate)
In my mind this is solving different problems. We want it to parse out our intent from ambiguous semantics because that's how humans actually think and speak. The ones who think they don't are simply unaware of themselves.
If we create this terse and unambiguous language for LLMs, it seems likely to me that they would benefit most from using it with each other, not with humans. Further, they already kind of do this with programming languages which are, more or less, terse and unambiguous expression engines for working with computers. How would we meaningfully improve on this, with enough training data to do so?
I'm asking sincerely and not rhetorically because I'm under no illusion that I understand this or know any better.
Ah, the Lisp curse. Here we go again.
coincidently, the 80s AI bubble crashed partly because Lisp dialetcts aren't inter-changable.
We literally have proof that an iron age ontology of meaning as represented in Chinese characters is 40% more efficient than naive statistical analysis over a semi phonetic language and we still are acting like more compute will solve all our problems.
Can you elaborate? I think you're talking about https://github.com/PastaPastaPasta/llm-chinese-english , but I read those findings as far more nuanced and ambiguous than what you seem to be claiming here.
Post a link because until you do, I’m almost certain this is pseudoscientific crankery.
Chinese characters are not an “iron age ontology of meaning” nor anything close to that.
Also please cite the specific results in centuries-old “language theory” that you’re referring to. Did Saussure have something to say about LLMs? Or someone even older?
Since every invocation of an LLM may create a different program, just like people, we will see that the spec will leave much room for good and bad implementations, and highlight the imprecision in the spec.
Once we start using a particular implementation it often becomes the spec for subsequent versions, because it's interfaces expose surface texture that other programs and people will begin to rely on.
I'm not sure how well LLMs will fare are brownfield software development. There is no longer a clean specification. Regenerating the code from scratch isn't acceptable. You need TPS reports.
This perfectly explains the feeling I had when, 20 years into my career, I had to start writing specs. I could never quite put my finger on why it was harder than coding. My greater familiarity with coding didn't seem a sufficient explanation.
When writing a line of spec, I had to consider how it might be interpreted in the context of different implementations of other lines of spec - combinatorial nightmare.
But code is just a spec as far as, say, a C compiler is concerned. The compiler is free to implement the assembly however it likes. Writing that spec is definitely easier than writing the assembly (Fred Brookes said this, so it must be true).
So why the difference?
But much of the code we run today is JIT executed, and that leaves ample room for exploiting with weird machines. Eg the TOCTOU in the Corina exploit.
Even at this very low level, full coverage specs require years of careful formal methods work. We have no hope of doing it at for vibe coding, everything will be iterative, and if TDD helps then good, but specs are by no means easier than code.
Not at all. Code is formal, and going from C to assembly os deterministic. While the rules may be complex, they are still rules, and the compiler can’t stray away from them.
Writing C code is easier than writing assembly because it’s an easier notation for thinking. It’s like when hammering a nail. Doing it with a rock is hard. But using a hammer is better. You’re still hitting the nail with a hard surface, but the handle, which is more suitable for a human hand, makes the process easier.
So programming languages are not about the machine performance, they are about easier reasoning. And good programmers can get you to a level above that with proper abstraction and architecture, and give you concepts that directly map to the problem space (Gui frameworks instead of directly rendering to a framebuffer, OS syscall instead of messing with hardware,…).
The spec should express all relevant constraints. If your spec admits two things and only one is admissible in your mind, your spec is incomplete.
> has a massive envelope
The size of the envelope is less relevant than the expressivity of the language used to express subsets of that envelope. But almost always there is some logic which is more succinct for expressing the spec than the programming language used to express the implementation.
> your security posture is now a function of how exhaustive your spec is
The alternative is that your security posture is a function of unstated intentions living in somebody’s brain. This alternative seems strictly worse.
> You'd need to enumerate what NOT to do
This is equivalent to declaring what you must do and usually there is a succinct way to do this that does not involve listing every negative example
I’m not sure if I agree. A good spec will specify what the program is supposed to do rather than how the program should do that thing. It should be easier to specify what rather than how. Moreover, it should be standard practice to express what before you start writing how.
I see some people pushing back on this by saying two programs that satisfy the same spec might have different performance or security properties. That is correct and if you care about those things, you should specify them. Writing down these properties, e.g., “the program is O(n) where n is blah” should be much easier than implementing a non-trivial linear time algorithm founded in deep ideas.
Natural language is imperfect, code is exact.
The goal of specs is largely to maintain desired functionality over many iterations, something that pure code handles poorly.
I’ve tried inline comments, tests, etc. but what works best is waterfall-style design docs that act as a second source of truth to the running code.
Using this approach, I’ve been able to seamlessly iterate on “fully vibecoded” projects, refactor existing codebases, transform repositories from one language to another, etc.
Obviously ymmv, but it feels like we’re back in the 70s-80s in terms of dev flow.
IMHO this could be achieved with large set of tests, but the problem is if you prompt an agent to fix tests, you can't be sure it won't "fix the test". Or implement something just to make the test pass without looking at a larger picture.
The latter notion probably is true, but the prior isn’t necessarily true because you can map natural language to strict schemas. ”Implement an interface for TCP in <language>” is probably shorter than the actual implementation in code.
And I understand my example is pedantic, but it extends to any unambiguous definitions. Of course one can argue that TCP spec is not determimistic by nature because natural language isn’t. But that is not very practical. We have to agree to trust some axioms for compilers to work in the first place.
To your point, there are some cases where a short description is sufficient and may have equal or less lines than code (frequently with helper functions utilizing well known packages).
In either case, we’re entering a new era of “compilers” (transpilers?), where they aren’t always correct/performant yet, but the change in tides is clear.
LLM -> Spec is easier, especially with good tools that can communicate why the spec fails to validate/compile back to the LLM. Better languages that can codify things like what can actually be called at a certain part of the codebase, or describe highly detailed constraints on the data model, are just going to win out long term because models don't get tired trying to figure this stuff out and put the lego bricks in the right place to make the code work, and developers don't have to worry about UB or nasty bugs sneaking in at the edges.
With a good 'compilable spec' and documentation in/around it, the next LLM run can have an easier time figuring out what is going on.
Trying to create 'validated english' is just injecting a ton of complexity away from the area you are trying to get actual work done: the code that actually runs and does stuff.
For a manager, the spec exists in order to create a delgation ticket, something you assign to someone and done. But for a builder, it exists as a thinking tool that evolves with the code to sharpen the understanding/thinking.
I also think, that some builders are being fooled into thinking like managers because ease, but they figure it out pretty quickly.
I really wish people would start defining what they mean when they say code is deterministic. For instance, code is not deterministic in the sense that it admits a single interpretation when compiled. C code admits many assembly interpretations and the assembly you get will vary depending on your compiler and parameters.
When people say code is deterministic I think they mean it has a formal semantics. But these formal semantics may admit any one of many interpretations.
There's your own example defining how code is deterministic. Despite being compiled to different assembly interpretations, the code does the same thing.
To invert your polynomial analogy, the specification might call for a sine wave, your code will generate a Taylor series approximation that is computable.
When you write software to solve a problem you start with as little as possible details such that when you read it, it would only talk about the business logic. What I mean by that? That you abstract away any other stuff that does not concern the domain you are in, for example your first piece of code should never mention any kind of database technology, it should not contain any reference to a specific communication layer (HTTP for example), and so on.
When you get to this, you have summarized the "spec", and usually it can be read very easily by a non-techincal person (but obviously is a "domain expert") and can also be very testable.
I hope this helps on why the author is right 100%
I guess many of us quality for british parliament.
“With this machine, all the results will be correct, no more errors in log tables”
“And what if people put the wrong figures in?” (Hint - we’d still have the wrong results)
Babbage walks away thinking them an idiot, they walk away thinking Babbage hasn’t considered anything outside of the machine itself.
But if I interpret the question with line of thinking "should I anticipate right/ full answers despite incorrect/ incomplete inputs?" I think Baggage was pointing out the problem in the logic why such questions should arise.
I would expect the question to be phrased "under what circumstances the machine with provide wrong outputs?", and would have hoped for Babbage (or may be anyone) explaining many ways how things could go wrong.
Plan mode is a trap. It makes you feel like you're actually engineering a solution. Like you're making measured choices about implementation details. You're not, your just vibe coding with extra steps. I come from an electrical engineering background originally, and I've worked in aerospace most of my career. Most software devs don't know what planning is. The mechanical, electrical, and aerospace engineering teams plan for literal years. Countless reviews and re-reviews, trade studies, down selects, requirement derivations, MBSE diagrams, and God knows what else before anything that will end up in the final product is built. It's meticulous, detailed, time consuming work, and bloody expensive.
That's the world software engineering has been trying to leave behind for at least two decades, and now with LLMs people think they can move back to it with a weekend of "planning", answering a handful of questions, and a task list.
Even if LLMs could actually execute on a spec to the degree people claim (they can't), it would take as long to properly define as it would to just write it with AI assistance in the first place.
If you use it that way fine, but in general I'm talking about the idea that can you plan throughly enough in advance to get it to produce tens of thousands of lines of quality working code.
Like the author of the article points out, that takes so much effort by the time your done you may as well have just written the code.
Specification means requirements. I like EARS [0] syntax for requirements.
e.g. "while an error is present, the software shall ignore keypresses"
Requirements are not code at all, they are expectations about what the code does; it is the contract about what developers will be held accountable to. Putting implementation details in requirements is a rookie mistake because it takes agency away from the engineers in finding the best solution.
The spec discussed in this article is more akin to the level of detail appropriate in an interface control document (ICD). Very common for a requirement to declare the software shall be compliant to a revision of an ICD.
My own thoughts are: like a good systems engineer that recognizes the software engineers know their domain better than they, we should write specification for AI that leaves room for it to be more clever than we ourselves are. What's the point of exhaustive pseudocoding, it's worse coding. Align on general project preferences, set expectations, and concentrate effort on verifying.
How probable is that given that LLMs output the average of their training data? Your own cleverness would have to be below average for this to hold.
I think it follows that if you ask it to implement something, it will implement the average thing.
But then it also follows that if you give it the average thing and ask it to improve it by doing something clever, it may identify a clever improvement if such an improvement has ever been done in an analogous setting that exists in its training set.
> Your own cleverness would have to be below average
In this case the difference between myself and the LLM is that I’ve never seen the clever thing before (or I cannot recall it as effectively), but the LLM has and the LLM is specialized to recall it
- Define the data structures in the code yourself. Add comments on what each struct/enum/field does.
- Write the definitions of any classes/traits/functions/interfaces that you will add or change. Either leave the implementations empty or write them yourself if they end up being small or important enough to write by hand (or with AI/IDE autocompletion).
- Write the signatures of the tests with a comment on what it's verifying. Ideally you would write the tests yourself, specially if they are short, but you can leave them empty.
- Then at this point you involve the agent and tell it to plan how to complete the changes without barely having to specify anything in the prompt. Then execute the plan and ask the agent to iterate until all tests and lints are green.
- Go through the agent's changes and perform clean up. Usually it's just nitpicks and changes to conform to my specific style.
If the change is small enough, I find that I can complete this with just copilot in about the same amount of time it would take to write an ambiguous prompt. If the change is bigger, I can either have the agent do it all or do the fun stuff myself and task the agent with finishing the boring stuff.
So I would agree with the title and the gist of the post but for different reasons.
Example of a large change using that strategy: https://github.com/trane-project/trane/commit/d5d95cfd331c30...
I found that to be really vital for good code. https://fsharpforfunandprofit.com/rop/
I’ve found that two to three iterations with various prompts or different models will often yield a surprising solution or some aspect I hadn’t thought of or didn’t know about.
Then I throw away most or all of the code and follow your process, but with care to keep the good ideas from the LLMs, if any.
The only vibecoded thing was an iOS app and I didn't follow this process because I don't know iOS programming nor do I want to learn it. This only works if you know at least how to define functions and data structures in the language, but I think most PMs could learn that if they set their minds to it.
- Do a brainstorming session with the LLM about your idea. Flesh out the major points of the product, who the stakeholders are, what their motivations and goals are, what their pain points are. Research potential competitors. Find out what people are saying about them, especially the complaints.
- Build a high level design document with the LLM. Go through user workflows and scenarios to discern what kinds of data are needed, and at what scale.
- Do more market research to see how novel this approach is. Figure out what other approaches are used, and how successful they are. Get user pain points with each approach if you can. Then revisit your high level design.
- Start a technical design document with the LLM. Figure out who the actors of the system are, what their roles are in the system, and what kinds of data they'll need in order to do their job.
- Research the technologies that could help you build the system. Find out how popular they are, how reliable they are, how friendly they are (documentation, error messaging, support, etc), their long-term track record, etc. These go into a research document.
- Decide based on the research which technologies match your system best. Start a technical document with the LLM. Go through the user scenarios and see how the technologies fit.
- Decide on the data structures and flows through the system. Caching, load balancing, reliability, throughput requirements at the scale you plan to reach for your MVP and slightly beyond. Some UX requirements at this point are good as well.
- Start to flesh out your interfaces, both user and machine. Prototype some ideas and see how well they work.
- Circle back to research and design based on your findings. Iterate a few times and update the documents as you go using your LLM. Try to find ways to break it.
- Once you're happy with your design, build an architecture document that shows how the whole system will concretely fit together.
- Build an implementation plan. Run it through multiple critique rounds. Try to find ways to break it.
- Now you're at the threshold where changes get harder. Build the implementation piece by piece, checking to make sure they work as expected. This can be done quickly with multiple LLMs in parallel. Expect that the pieces won't fit and you'll need to rethink a lot of your assumptions. Code will change a LOT, so don't waste much time making it nice. You should have unit and integration tests and possibly e2e tests, which are cheap for the LLM to maintain, even if a lot of them suck.
- Depending on how the initial implementation went, decide whether to keep the codebase and refine it, or start the implementation over using the old codebase for lessons learned.
Basically, more of the same of what we've been doing for decades, just with faster tools.
I have always done that steps with my team and the results are great.
If you are a solo developer I understand that the LLM can help somewhat but not replace a real team of developers.
fn sin(x: f16) -> f16
There are only 64k different f16s. Easy enough to test them all. A given sin() is either correct or it's not.Yet sin() here can have a large number of different implementations. The spec alone under-determines the actual code.
The code will always be an imperfect projection of the specification, and that is a feature. It must be decoupled to some extent or everything would become incredibly brittle. You do not need your business analysts worrying about which SQLite provider is to be used in the final shipped product. Forcing code to be isomorphic with spec means everyone needs to know everything all the time. It can work in small tech startups, but it doesn't work anywhere else.
Most people don't specify or know that they want a U bend in their pipes or what kind or joints should be used for their pipes.
The absence of U bends or use or poor joints will be felt immediately.
Thankfully home building is a relatively solved problem whereas software is everything bespoke and every problem slightly different... Not to mention forever changing
- Delete code and start all over with the spec. I don't think anyone's ready to do that.
- Buy a software product / business and be content with just getting markdown files in a folder.
I haven't heard of being content paying for a product consisting of markdown files. Though I could imagine people paying for agent skill files. But yet, the skills are not the same product as say, linear.
I used to be on that side of the argument - clearly code is more precise so it MUST be simpler than wrangling with the uncertainty of prose. But precision isn't the only factor in play.
The argument here is that essential complexity lives on and you can only convert between expressions of it - that is certainly true but it's is overlooking both accidental complexity and germane complexity.
Specs in prose give you an opportunity to simplify by right-sizing germane complexity in a way that code can't.
You might say "well i could create a library or a framework and teach everyone how to use it" and so when we're implementing the code to address the essential complexity, we benefit from the germane complexity of the library. True, but now consider the infinite abstraction possible in prose. Which has more power to simplify by replacing essential complexity with germane complexity?
Build me a minecraft clone - there's almost zero precision here, if it weren't for the fact that word minecraft is incredibly load bearing in this sentence, then you'd have no chance of building the right thing. One sentence. Contrast with the code you'd have to write and read to express the same.
This article is really attacking vague prose that pushes ambiguity onto the agent - okay, fair enough. But that's a tooling problem. What if you could express structure and relationships at a higher level than text, or map domain concepts directly to library components? People are already working on new workflows and tools to do just that!
Also, dismissing the idea that "some day we'll be able to just write the specs and the program will write itself" is especially perplexing. We're already doing it, aren't we? Yes, it has major issues but you can't deny that AI agents are enabling literally that. Those issues will get fixed.
The historical parallel matters here as well. Grady Booch (co-creator of UML) argues we're in the third golden age of software engineering:
- 1940s: abstracted away the machine -> structured programming
- 1970s: abstracted away the algorithm -> OOP, standard libraries, UML
- Now: abstracting away the code itself
Recent interview here: https://www.youtube.com/watch?v=OfMAtaocvJw
Each previous transition had engineers raising the same objections: "this isn't safe", "you're abstracting away my craft". They were right that something was lost, but wrong that it was fatal. Eventually the new tools worked well enough to be used in production.
Which was mostly a failure, to the point that there is a major movement towards languages that "abstract away" (in this sense) less, e.g. Rust.
Certainly if the creators of UML are saying that AI is great, that gives me more confidence than ever that it's bunk.
Rust uses crates to import those dependencies, which was one of its biggest innovations.
A sufficiently detailed spec need only concern itself with essential complexity.
Applications are chock-full of accidental complexity.
Technical design docs are higher level than code, they are impricise but highlight an architectural direction. Blanks need to be filled in. AI Shines here.
Formal specs == code Some language shine in being very close to a formal spec. Yes functional languages.
But lets first discuss which kind of spec we talk about.
That is simply not true. There is a ton of litterature around inherent vs accidental complexity, which in an ideal world should map directly to spec vs code. There are a lot of technicalities in writing code that a spec writer shouldn't know about.
Code has to deal with the fact that data is laid out a certain way in ram and on disk, and accessing it efficiently requires careful implementation.
Code has to deal with exceptions that arise when the messiness of the real world collides with the ideality of code.
It half surprises me that this article comes from a haskell developer. Haskell developers (and more generally people coming from maths) have this ideal view of code that you just need to describe relationships properly, and things will flow from there.
This works fine up to a certain scale, where efficiency becomes a problem.
And yes, it's highly probable that AI is going to be able to deal with all the accidental complexity. That's how I use it anyways.
That is the great insight I can offer
Looks like I would have been better off!
We have spent decades working on reproducible builds or deterministic compilation. To achieve this, all steps must be deterministic. LLMs are not deterministic. You need to commit source code.
I did a side project with a non-technical co-founder a year ago and every time he told me what he wanted, I made a list of like 9 or 10 logical contradictions in his requirements and I had to walk him through what he said with drawings of the UI so that he would understand. Some stuff he wanted me to do sounded good in his head but once you walk through the implementation details, the solution is extremely confusing for the user or it's downright physically impossible to do based on cost or computational resource constraints.
Sure, most people who launched a successful product basically stumbled onto the perfect idea by chance on the first attempt... But what about the 99% others who fell flat on their face! You are the 99% and so if you want to succeed by actual merit, instead of becoming a statistic, you have to think about all this stuff ahead of time. You have to simulate the product and business in detail in your mind and ask yourself honestly; is this realistic? Before you even draw your first wireframe. If you find anything wrong with it, anything wrong at all; it means the idea sucks.
It's like; this feature is too computationally and/or financially expensive to offer for free and not useful enough to warrant demanding payment from users... You shouldn't even waste your time with implementation; it's not going to work! The fundamental economics of the software which exists in your imagination aren't going to magically resolve themselves after implementing in reality.
Translating an idea to reality never resolves any known problems; it only adds more problems!
The fact is that most non-technical people only have a very vague idea of what they want. They operate in a kind of wishy washy, hand-wavy emotion-centric environment and they think they know what they're doing but they often don't.
Like they say “everything comes round again”
> For every specification satisfied by the input program, the output program satisfies the same specification.
This is not a program and it does not become a program once you fill in the holes. Making the statement precise clearly requires a formal language, but that language can work at a higher level of abstraction than a programming language. So yes, a specification can absolutely be simpler than a program that implements it.
If you are only noting down requirements of what the program is doing rather than how it should do that thing, I would expect that writing the requirements would necessarily be more succinct than writing the program.
It terrifies me every time I realize people are just jumping into writing code without specifying what that code is required to do. A day of hacking saves an hour of critical thought and planning.
Anyone who studied software engineering, should know that specification doesn’t bother with implementation details of the underlying technology.
Things such as quite specific engine are used, are the contents of an encapsulated subsystem.
Proper software engineering specification is incompatible with a hacker culture and picking technology beforehand is a bad practice. It’s much closer to waterfall than to C4.
However, the last 20 years we got software building blocks which impose system architectural restrictions: frameworks. And also pieces of software which are half cooked systems.
Far are the days of requirements, preconditions, postconditions and invariants, network diagrams and entity relationship models.
Then came all sorts of shenanigans, from memory management to syntax hell, which took forever to learn effectively.
This stage was a major barrier to entry, and it's now gone — so yeah, things have indeed changed completely.
To me, spec answers the what, the plan answers the how, and in what order, build packets answer the how but with more granularity.
In most cases, you should only care about the what. How it gets done (plan) is simply an implementation detail that you should not care about the same way automated tests should not care about them.
What you prescribe in spec is that data must pass from A to B through C, preserved in D and presented in E in shape of F. It's much easier to write (and change) this in spec than in say, Rust.
Simple thought experiment: Imagine you have a spec and some code that implements it. You check it, and find that a requirement in the spec was missed. Obviously that code is not the spec; the spec is the spec. But also you couldn't even use the code as a spec because it was wrong. Now remove the spec.
Is the code a spec for what was supposed to be built? No. A requirement was missed. Can you tell from just the code? Also no. You need a two separate sources that tell you what was meant to be written in case the either of them is wrong. That is usually a spec and the code.
They could both be wrong, and often are, but that's a people problem.
I feel like if you’re designing a language, the activity of producing the spec, which involves the grammar etc., would allow you to design unencumbered by whether your design is easy to implement. Or whether it’s a good fit for the language you are implementing the compiler with.
The OP also correctly identifies that thoughtful design takes a back seat in favor of action when we start writing the code.
So unless you want bugs to be your specification, you actually need to specify what you want.
If you want a specification from source code, you need to reverse engineer it. Although that’s a bit easier now, with LLMs.
Pure specification itself is useless without actual implementation (which does the job), so, trying to write such specification (in a natural language) has no practical sense.
If you look at what implementions modern compilers actually come up with, they often look quite different from what you would expect from the source code
Could be that the truth is somewhere in between?
I also enjoyed the writing style so much that I felt bad for myself for not getting to read this kind of writing enough. We are drowning in slop. We all deserve better!
Once the project can't fit in model's usable context window (~150k tokens even for 1M models), you need code fighting back and leaving breadcrumbs for model to follow.
LLMs are very good at bash, which I’d argue doesn’t read like language or math.
Those very detailed specs then let agents run for a long time without supervision so nice for multi tasking :)
Confusing the *how* and the *what* is very common when discussing specifications, in my experience. Programmers gravitate toward pseudocode when they have trouble articulating a functional requirement.
> Specifications were never meant to be time-saving devices.
Correct. Anyone selling specifications as a way to save time does not understand the purpose of a specification. Unfortunately, neither does the article's author. The article is based on a false premise.
LLMs experience the same problems as humans when provided with underspecified requirements. That's a specification problem.
[1]: https://en.wikipedia.org/wiki/Software_requirements_specific...
Or everything will converge toward a rust like inference optimized gibberish...
“This time is different” are the famous last words.
I sort of see the success of coding agents as not just a sign that people get tricked by slop, but mainly a sign that today's languages and frameworks just aren't that good. And now it's easier to say to the LLM, "I literally don't care how the UI is organized as long as everything is visible" than to try to express that in code.
[1]: I mean this in the vaguest sense, so including DSLs/frameworks/interfaces offered by libraries too
Simply put: Formal language = No ambiguities.
Once you remove all ambiguous information from an informal spec, that, whatever remains, automatically becomes a formal description.
When I code by hand under time pressure, I’m actually more likely to dig a hole. Not because I can’t write code, but because humans get impatient, bored, optimistic and sloppy in predictable ways. The machine doesn’t mind doing the boring glue work properly.
But that is not the real problem.
The real problem is what happens when an organisation starts shipping code it does not understand. That problem predates LLMs and it will outlive them. We already live in a world full of organisations that ship bad systems nobody fully understands, and the result is the usual quagmire: haunted codebases, slow change, fear-driven development, accidental complexity, and no one knowing where the actual load-bearing assumptions are.
LLMs can absolutely make that worse, because they increase the throughput of plausible code. If your bottleneck used to be code production, and now it’s understanding, then an organisation that fails to adapt will just arrive at the same swamp faster.
So to me the important distinction is not “spec vs code”. It’s more like:
• good local code is not the same thing as system understanding
• passing tests are not the same thing as meaningful verification
• shipping faster is not the same thing as preserving legibility
The actual job of a programmer was never just to turn intent into syntax anyway. Every few decades the field reinvents some story about how we no longer need programmers now: Flow-Matic, CASE tools, OO, RUP, low-code, whatever. It’s always the same category error. The hard part moves up a layer and people briefly mistake that for disappearance.
In practice, the job is much closer to iteratively solving a problem that is hard to articulate. You build something, reality pushes back, you discover the problem statement was incomplete, the constraints were wrong, the edge case was actually central, the abstraction leaks, the user meant something else, the environment has opinions, and now you are solving a different problem than the one you started with.
That is why I think the truly important question is not whether AI can write code.
It’s whether the organisation using it can preserve understanding while code generation stops being the bottleneck.
If not, you just get the same bad future as before, only faster, cleaner-looking, and with more false confidence.
The blue print analogy comes up frequently. IMHO this is unfortunate. Because a blueprint is an executable specification for building something (manually typically). It's code in other words. But for laborers, construction workers, engineers, etc. For software we give our executable specifications to an interpreter or compiler. The building process is fully automated.
The value of having specifications for your specifications is very limited in both worlds. A bridge architect might do some sketches, 3D visualizations, clay models, or whatever. And a software developer might doodle a bit on a whiteboard, sketch some stuff out on paper or create a "whooooo we have boxes and arrows" type stuff in a power point deck or whatever. If it fits on a slide, it has no meaningful level of detail.
As for AI. I don't tend to specify a lot when I'm using AI for coding. A lot of specification is implicit with agentic coding. It comes from guard rails, implicit general knowledge that models are trained one, vague references like "I want red/green TDD", etc. You can drag in a lot of this implicit stuff with some very rudimentary prompting. It doesn't need to be spelled out.
I put an analytics server live a few days ago. I specified I wanted one. And how I wanted it to work. I suggested Go might be a nice language to build it in (I'm not a Go programmer). And I went in to some level of detail on where and how/where I wanted the events to be stored. And I wanted a light js library "just like google analytics" to go with it. My prompt wasn't much larger than this paragraph. I got what I asked for and with some gentle nudging got it in a deployable state after a few iterations.
A few years ago you'd have been right to scald me for wasting time on this (use something off the shelf). But it took about 20 minutes to one shot this and another couple of hours to get it just right. It's running live now. As far as I can see with my few decades of experience, it's a pretty decent version of what I asked for. I did not audit the code. I did ask for it to be audited (big difference) and addressed some of the suggested fixes via more prompting ("sounds good, do it").
If you are wondering why, I'm planning to build a AI dashboard on top of this and I need the raw event store for that. The analytics server is just a dirt cheap means to an end to get the data where I need it. AI made the server and the client, embedded the client in my AI generated website that I deployed using AI. None of this involved a lot of coding or specifying. End to end, all of this work was completed in under a week. Most of the prompting work went into making the website really nice.
Of course I also use the AI to refine the spec, but again not to 100% detail, only to ensure I truly cover all that matters to me.
"Is this it?" "NOPE"
Say you and I both wrote the same spec that under-specifies the same parts. But we both expect different behavior, and trust that LLM will make the _reasonable_ choices. Hint: “The choice that I would have made.”
Btw, by definition, when we under-specify we leave some decisions to the LLM unknowingly.
And absent our looks or age as input, the LLM will make some _reasonable_ choices based on our spec. But will those choices be closer to yours or mine? Assuming it won’t be neither.
Writing code has always been a means to an end of getting the job done, and I consider being really interested in the practice of coding to be a bit odd. It's as if a carpenter got really into manual saws and got upset about an electric power saw because he "loves manual saws". Whether or not an AI spec is "code" really shouldn't matter if you're a software professional. You are designer/engineer of software systems, not a "coder".
Let’s first agree that code is the final output of a software engineer, and that a wooden object is the final output of a carpenter.
A correct analogy would be, if a software engineer’s AI assisted output is reduced to assembling auto generated code snippets into a workable product, a carpenters is assembling ikea furniture.
The only missing piece is liability. The engineer and carpenter are both hired to assemble something, so they better damn well fully understand what they are delivering because they are being paid to own the liability if anything fails.
Posting something like this in 2026: they must not have heard of LLMs.
And also: this is such a typical thing a functional programmer would say: for them code is indeed a specification, with strictly no clue (or a vague high level idea at most) as to how the all effing machine it runs on will actually conduct the work.
This is not what code is for real folks with real problems to solve outside of academic circles: code is explicit instructions on how to perform a task, not a bunch of constraints thrown together and damned be how the system will sort it out.
And to this day, they still wonder why functional programming almost never gets picked up in real world application.