My experience to date across the major LLMs is that they are quick to leap to complex solutions, and I find that the code often is much harder to maintain than if I were to do it myself.
But complex code is only part of the problem. Another huge problem I see is the rapid accumulation of technical debt. LLMs will confidently generate massive amounts of code with abstractions and design patterns that may be a good fit in isolation, but are absolutely the wrong pattern for problem you're trying to solve or the system you're trying to build. You run into the "existing code pattern" problem that Sandi Metz talked about in her fantastic 2014 RailsConf talk, "All the little things" [0]:
> "We have a bargain to follow the pattern, and if the pattern is a good one then the code gets better. If the pattern is a bad one, then we exacerbate the problem."
Rapidly generating massive amounts of code with the wrong abstractions and design patterns is insidious because it feels like incredible productivity. You see it all the time in posts on e.g. Twitter or LinkedIn. People gushing about how quickly they are shipping products with minimal to zero other humans involved. But there is no shortcut to understanding or maintainability if you care about building sustainable software for the medium to long-term.
EDIT: Forgot to add link
I've built some rather complex systems:
Guish, a bi-directional CLI/GUI for constructing and executing Unix pipelines: https://github.com/williamcotton/guish
WebDSL, fast C-based pipeline-driven DSL for building web apps with SQL, Lua and jq: https://github.com/williamcotton/webdsl
Search Input Query, a search input query parser and React component: https://github.com/williamcotton/search-input-query
The greenfield projects turn into a mess very quick because if you let it code without any guidance (wrt documentation, interactivity, testability, modularity) it generates crap until you can't modify it. The greenfield project turns into legacy as fast as the agent can spit out new code.
This is an important point. Unconstrained code generation lets you witness accelerated codebase aging in real-time.
Easier to test, lower cognitive overload, and it's faster to onboard someone when they only need to understand a small part at a time.
I almost wonder if these LLMs can be used to assess the barrier for onboarding. If it gets confused and generating shitty suggestions, I wonder if that could that be a good informal smoke alarm for trouble areas the next junior will run into.
You should not be structuring the code for a LLM alone but I have found that trying to be very modular has helped both my code as well as the ability to utilize LLM on top of it.
Eventually I ended up looking at the notebook and the extracted code side-by-side and carefully checking every line. Despite being split across dozens of cells, it would have been faster if I had started out by just manually copying the code out of each meaningful cell and pasted it all together.
These are big legacy projects where I didn’t write the code to begin with, so having an AI partner would have been really nice.
The example prompts are useful. They not only reduced the activation energy required for me to start installing this habit in my personal workflows, but also inspired the notion that I can build a library of good prompts and easily implement them by turning them into TextExpander snippets.
P.S.: Extra credit for the Insane Clown Posse reference!
One idea I really like here is asking the model to generate a todo list.
It looks like the sort of nonproductive yak-shaving you do when you're stuck or avoiding an unpleasant task--coasting, fooling around incrementally with your LLM because your project's fucked and you psychologically need some sense of progress.
The opposite of this is burnout--one of the things they don't tell you about successful projects with good tools is they induce much more burnout than doomed projects. There's a sort of Amdahl's Law in effect, where all the tooling just gives you more time to focus on the actual fundamentals of the product/project/problem you’re trying to address, which is stressful and mentally taxing even when it works.
Fucking around with LLM coding tools, otoh, is very fun, and like constantly clean-rebuilding your whole (doomed) project, gives you both some downtime and a sense of forward momentum--look how much the computer is chugging!
The reality testing to see if the tool is really helping is to sit down with a concrete goal and a (near) hard deadline. Every time I've tried to use an LLM under these conditions it just fails catastrophically--I don't just get stuck, I realize how basically every implicit decision embedded in the LLM output has an unacceptably high likelihood of being wrong, and I have an amount of debug cycles ahead of me exceeding the time to throw it all away and do it without the LLM by, like, an order of magnitude.
I'm not an LLM-coding hater and I've been doing AI stuff that's worked for decades, but current offerings I've tried aren't even close to productive compared to searching for code that already exists on the web.
It’s one of those things where a little upskilling can make a big impact. So many things in life need a bit of practice before they’re useful to you.
For starters, you need to change the default prompt in your editor to make it do what you want. If it does something annoying or weird, put it in the prompt to not take that approach. For me, that was absurdly long, useless explanations. And now it’s short and sweet.
ouch. You've thought about this, haven't you? Your ideas are intriguing to me, and I wish to subscribe to your newsletter.
I believe most people who struggle to be productive with language models simply haven’t put in the necessary practice to communicate effectively with AI. The issue isn’t with the intelligence of the models—it’s that humans are still learning how to use this tool properly. It’s clear that the author has spent time mastering the art of communicating with LLMs. Many of the conclusions in this post feel obvious once you’ve developed an understanding of how these models "think" and how to work within their constraints.
I’m a huge fan of the workflow described here, and I’ll definitely be looking into AIder and repomix. I’ve had a lot of success using a similar approach with Cursor in Composer Agent mode, where Claude-3.5-sonnet acts as my "code implementer." I strategize with larger reasoning models (like o1-pro, o3-mini-high, etc.) and delegate execution to Claude, which excels at making inline code edits. While it’s not perfect, the time savings far outweigh the effort required to review an "AI Pull Request."
Maximizing efficiency in this kind of workflow requires a few key things:
- High typing speed – Minimizing time spent writing prompts means maximizing time generating useful code.
- A strong intuition for "what’s right" vs. "what’s wrong" – This will probably become less relevant as models improve, but for now, good judgment is crucial.
- Familiarity with each model’s strengths and weaknesses – This only comes with hands-on experience.
Right now, LLMs don’t work flawlessly out of the box for everyone, and I think that’s where a lot of the complaints come from—the "AI haterade" crowd expects perfection without adaptation.
For what it’s worth, I’ve built large-scale production applications using these techniques while writing minimal human code myself.
Most of my experience using these workflows has been in the web dev domain, where there's an abundance of training data. That said, I’ve also worked in lower-level programming and language design, so I can understand why some people might not find models up to par in every scenario, particularly in niche domains.
Let’s be honest. The author was probably playing cookie clicker while this article was being written.
Does the time invested into the planning benefit you? Have you noticed less hallucinations? Have you saved time overall?
I’d be curious to hear because my current workflow is basically
1. Have idea
2. create-next-app + ShadCN + TailwindUI boilerplate
3. Cursor Composer on agent mode with Superwispr voice transcription
I’m gonna try the author’s workflow regardless, but would love to hear others opinions.
With all the layoffs in our sector, I wouldn't blame you if you didn't share it, so thank you for sharing.
Do your rules count as frequent steering and lead to increased 'accuracy', or is that the 'accuracy' you're seeing with your current workflow, rules and all?
I also have a scratchpad file that I tell the model it can update to reflect anything new it learns, so that gives it a crude form of memory as it works on the codebase. This does help it use internal utility APIs.
- small .cursorrules file explaining what I am trying to build and why at a very high level and my tech stack
- a DEVELOPMENT.md file which is just a to-do/issue list for me that I tell cursor to update before every commit
- a temp/ directory where I dump contextual md and txt files (chat logs discussing feature, more detailed issue specs, etc.)
- a separate snippet management app that has my commonly used request snippets (write commit message, ask me clarify questions, update README, summarize chat for new session, etc.)
Otherwise it's pretty much what your workflow is.
Most of these workflows are just context management workflows and in Cursor it's so simple to manage the context.
For large files I just highlight the code and cmd+L. For short files, I just add them all by using /+downarrow
I constantly feed context like this and then usually come to a good solution for both legacy and greenfield features/products.
If I don't come to a good solution it's almost always because I didn't think through my prompt well enough and/or I didn't provide the correct context.
For example, rather than using Plasmo for its browser extension boilerplate and packaging utilities, I’ve chosen to ask the LLM to setup all of that for me as it won’t have any blindspots when tasked with debugging.
At some point, specialized code-gen transformer models should get really good at just spitting out the lowest level code required to perform the job.
Having 7 different instances of an LLM analyzing the same code base and making suggestions would not just be economically wasteful, it would also be unpractical or even dangerous?
Outside of RAG, which is a different thing, are there products that somehow "centralize" the context for a team, where all questions refer to the same codebase?
It did seem to take a while to index, even though my colleagues had been using Cursor for a while, so I'm likely misunderstanding something.
I ended up finishing my side projects when I kept these in mind, rather than focusing on elegant code for elegant code's sake.
It seems the key to using LLMs successfully is to make them create a specification and execution plan, through making them ask /you/ questions.
If this skill--specification and execution planning--is passed onto LLMs, along with coding, then are we essentially souped-up tester-analysts?
> if it doesn’t work, Q&A with aider to fix
I fix errors myself, because LLMs are capable of producing large chunks of really stupid/wrong code, which needs to be reverted, and thats why it makes sense to see the code at least once.
Also I used to find myself in a situation when I tried to use LLM for the sake of using LLM to write code (waste of time)
Because a lot of the benefits of LLM is bringing ideas or questions I am not thinking of right now, and this really does that. Typically this would happen as I dig through a topic, not before hand. So that's a net benefit.
I also tried it and it worked a charm, the LLM did respect context and the step by step approach, poking holes in my ideas. Amazing work.
I still like writing codes and solving puzzles in my mind so I won't be doing the "execution" part. From there on, I mostly use LLM as auto complete and I'm stuck here or obscure bug solving. Otherwise, I don't get any satisfaction from programming, having learnt nothing.
Your workflow is much more polished, will definitely try it out for my next project
is more polished? What's your workflow, banging rocks together?
For example, there is this common challenge, "count how many r letters in strawberry", and you can see the issue is not counting, but that model does not know if "rr" should be treated as single "r" because it is not sure if you are counting r "letters" or r "sounds" and when you sound out the word, there is a single "r" sound where it is spelled with double "r". so if you tell the model, double "r" stands for 2 letters, it will get it right.
I have no idea if it works or not!
What am I doing wrong or what am I missing? My experience has been so underwhelming I just don’t understand the hype for why people use Claude over something else.
Sorry I know there are many models out there, and Claude is probably better than 99% of them. Can someone help me understand the value of it over o1/o3? I honestly feel like I like 4o better.
/frustration-rant
The key is to give it context so it can help you. For example, if you want it to help you with Spark configuration, give it the Spark docs. If you want it to help you write code, give it your codebase.
Tools like cursor and the like make this process very easy. You can also set up a local MCP server so the LLM can get the context and tools it needs on its own.
Thanks again!
I act as the director, creativity and ideas person. I have one LLM that implements, and a second LLM that critiques and suggests improvements and alternatives.
https://news.ycombinator.com/item?id=43057907
Other than that a great article! Very insightful.
Software is going to be prompt wrangling with some acceptance testing. Then just prompt wrangling.
I don't have a lot of hope for the software profession to survive.
This i think is the grand vision -- what could it look like?
in my mind programming should look like a map -- you can go anywhere, and there'll be things happening. and multiple people.
If anyone wants to work on this (or have comments, hit me up!)
The real thing that sold me is the entire workflow takes 10 minutes to plan, and then 10-15 minutes to execute (let's say a python script of medium complexity). after a solid ~20-30 min I am largely done. no debugging necessary.
it would have taken me an hour or two to do the same script.
this means i can spend a lot more time with the fam, hacking on more things, and messing about.
Also what do you mean by “I really want someone to solve this problem in a way that makes coding with an LLM a multiplayer game. Not a solo hacker experience.” ?
Total tokens in: 26,729,994 Total tokens out: 1,553,284
Last month anthropic bill was $89.30
--
I want to program with a team, together. not a group of people individually coding with an agent, and then managing merges. I have been playing a lot with merging team context - but haven't gotten too far yet.
Noprofit X publishes outputs from competing AI, which is not copyrightable.
Corp Y injests content published by Nonprofit X.
As opposed to Vintage Pioneer code?
Legacy modern code would be anything from the last 5-10 years. Vintage Pioneer code (which i have both initialized, and maintained) is more than 20 years old.
I am trying not to be a vintage pioneer these days.
so either you put the whole codebase into the context (will mostly lead to problems as tokens are limited) or you have some kind of summary with your current features etc.
Or you do some kind of "black box" iterations, which I feel won’t be that useful for new features, as the model should know about current features etc?
What’s the way here?
Isn't the input token number way more limited than that ?
This is part is unclear to me in the "non-Greenfield" part of the article.
Iterating with aider on very limited scopes is easy, I've used it often. But what about understanding a whole repository and act on it ? Following imports to understand a Typescript codebase as a whole ?
I was just wondering how to give my edits back to in-browser tools like Claude or ChatGPT, but the idea of repo mix is great, will try!
Although I have been flying bit with copilot in vscode, so right now I have essentially two AI, one for larger changes (in the browser), and then minor code fixes (in vscode).
It is linked to in the article - a brilliant utility from Simon.
Also, don't forget that your favorite AI tools can be of great help with the factors that cause us to make software: research, subject expertise, marketing, business planning, etc.
i will dig in again. It is an exciting idea.
that is the first thing you send to aider.
also - there was a joke below, but you can do --yes-always and it will not ask for confirmation. I find it does a pretty good job.
yes | aider“Over my skis” ~ in over my head.
“Over my skies” ~ very far overhead. In orbit maybe?
Is that correct? Never heard the expression before, but as a skier if you're over your skis you're in control of them, while if you're backseated the skis will control you.