I have similar years experience and regularly try out AI for development but always find it’s slower for the things I want to build and/or that it produces less than satisfactory results.
Not sure if it’s how I use the models (I’ve experimented with all the frontier ones), or the types of things I’m building, or the languages I’m using, or if I’m not spending enough, or if it’s just my standards are too high for the code that is produced but I usually always end up going back to doing things by hand.
I try to keep the AI focused on small well defined tasks, use AGENT.MD and skills, build out a plan first, followed by tests for spec based development, keep context windows and chats a reasonable length etc, but if I add up all that time I could have done it myself and have better grasp of the program and the domain in the process.
I keep reading how AI is a force multiplier but I’m yet to see it play out for myself.
I see lots of posts talking about how much more productive AI has made people, but very few with actual specifics on setup, models, costs, workflows etc.
I’m not an AI doomer and would love to realize the benefits people are claiming they get.... but how to get there is the question
Initially I was astounded by the results.
Then I wrote a large feature (ad pacing) on a site using LLMs. I learned the LLMs did not really understand what they were doing. The algorithm (PID controller) itself was properly implemented (as there is plenty of data to train on), but it was trying to optimize the wrong thing. There were other similar findings where LLM was doing very stupid mistakes. So I went through a disillusionment stage and kind of gave up for a while.
Since then, I have learned how to use Claude Code effectively. I have used it mostly on existing Django code bases. I think everybody has a slightly different take on how it works well. Probably the most reasonable advice is to just keep going and try different kind of things. Existing code bases seem easier, as well as working on a spec beforehand, requiring tests etc. basic SWE principles.
This is step 3 of “draw the rest of the owl” :-)
> the most reasonable advice is to just keep going and try different kind of things.
This is where I’ve been at for a while now. Every couple of months I try again with latest models and latest techniques I hear people talking about but there’s very little concrete info there that works for me.
Then I wonder if it’s just my spend? I don’t mind spending $30/month to experiment but I’m not going to drop $300/month unless I can see evidence that it’ll be worth it, which I haven’t really seen, but maybe there’s a dependency and you don’t get the result without increased spend?
Some posts I’ve seen claim spending of $1,500/month, which would be worth it if it could increase productivity enough, but there’s very few specifics on workflows and results.
I use Claude every day for everything, it's amazing value for money.
Give it a specific task with the context it needs, that's what I find works well, then iterate from there. I just copy paste, nothing fancy.
Fair enough :-)
This reminds me about pigeon research by Skinner. Skinner placed hungry pigeons in a "Skinner box" and a mechanism delivered food pellets at fixed, non-contingent time intervals, regardless of the bird's behavior. The pigeons, seeking a pattern or control over the food delivery, began to associate whatever random action they were performing at the moment the food appeared with the reward.
I think we humans have similar psychology, i.e. we tend to associate superstitions about patterns of what were doing when we got rewards, if they happen at random intervals.
To me it seems we are at a phase where what works with LLMs *(the reward) are still quite random, but it is psychologically difficult for us to admit it. Therefore we try to invent various kinds of theories of why something appears to work, which are closer to superstitions than real repeatable processes.
It seems difficult to really generalize repeatable processes of what really works, because it depends on too many things. This may be the reason why you are unsuccessful when using these descriptions.
But while it seems less useful to try to work based on theories of what works -- although I had skeptical attitude -- I have found that LLMs can be huge productive boost -- but it really depends on the context.
It seems you just need to keep trying various things, and eventually you may find out what works for you. There is no shortcut where you just read a blog post and then you can do it.
Things I have tried succesfully: - modifying existing large-ish Django projects, adding new apps to it. It can sometimes use Django components&HTMX/AlpineJS properly, but sometimes starts doing something else. One app uses tenants, and LLM appears to constantly struggle with this. - creating new Django projects -- this was less successful than modifying existing projects, because LLM could not imitate practices - Apple Swift mobile and watch applications. This was surprisingly succesful. But these were not huge apps. - python GUI app was more or less succesful - GitHub Pages static web sites based on certain content
I have not copied any CLAUDE.md or other files. Every time Claude Code does something I don't appreciate, I add a new line. Currently it is at 26 lines.
I have made a few skills. They are mostly so that they can work independently in a loop, for example test something that does not work.
Typically I try to limit the technologies to something I know really well. When something fails, I can often quickly figure out what is wrong.
I started with the basic plan (I guess it is that $30/month). I only upgraded to $100 Max and later to $180 2xMax because I was hitting limits.
But reason I was hitting limits was because I was working on multiple projects on multiple environments at the same time. The only difference I have seen is that I have hit the limits. I have not seen any difference in quality.
You will certainly understand a program better where you write every line of code yourself, but that limits your output. It's a trade-off.
The part that makes it work quite well is that you can also use the LLM to better understand the code where required, simply by asking.
The difference between delegating to a human vs an LLM is that a human is liable for understanding it, regardless of how it got there. Delegating to an LLM means you're just more rapidly creating liabilities for yourself, which indeed is a worthwhile tradeoff depending on the complexity of what you're losing intimate knowledge of.
In the end the person in charge is liable either way, in different ways.
Maybe that’s just the level I gave up at and it’s a matter of reworking the Claude.md file and other documentation into smaller pieces and focusing the agent on just little things to get past it.
It doesn’t have to be exactly how I would do it but at a minimum it has to work correctly and have acceptable performance for the task at hand.
This doesn’t mean being super optimized just that it shouldn’t be doing stupid things like n+1 requests or database queries etc.
See a sibling comment for one example on correctness, another one related to performance was querying some information from a couple of database tables (the first with 50,000 rows the next with 2.5 million)
After specifying things in enough detail to let the AI go, it got correct results but performance was rather slow. A bit more back and forthing and it got up to processing 4,000 rows a second.
It was so impressed with its new performance it started adding rocket ship emojis to the output summary.
There were still some obvious (to me) performance issues so I pressed it to see if it could improve the performance. It started suggesting some database config tweaks which provided some marginal improvements but was still missing some big wins elsewhere - namely it was avoiding “expensive” joins and doing that work in the app instead - resulting in n+1 db calls.
So I suggested getting the DB to do the join and just processing the fully joined data on the app side. This doubled throughout (8,000 rows/second) and led to claims from the AI this was now enterprise ready code.
There was still low hanging fruit though because it was calling the db and getting all results back before processing anything.
After suggesting switching to streaming results (good point!) we got up to 10,000 rows/second.
This was acceptable performance, but after a bit more wrangling we got things up to 11,000 rows/second and now it wasn’t worth spending much extra time squeezing out more performance.
In the end the AI came to a good result, but, at each step of the way it was me hinting it in the correct direction and then the AI congratulating me on the incredible “world class performance” (actual quote but difficult to believe when you then double performance again).
If it has just been me I would have finished it in half the time.
If I’d delegated to a less senior employee and we’d gone back and forth a bit pairing to get it to this state it might have taken the same amount and effort but they would’ve at least learnt something.
Not so with the AI however - it learns nothing and I have to make sure I re-explain things and concepts all over again the next time and in sufficient detail that it will do a reasonable job (not expecting perfection, just needs to be acceptable).
And so my experience so far (much more than just these 2 examples) is that I can’t trust the AI to the point where I can delegate enough that I don’t spend more time supervising/correcting it than I would spend writing things myself.
Edit: using AI to explain existing code is a useful thing it can do well. My experience is it is much better at explaining code than producing it.
`database-query-speed-optimization` "Some rules of thumb for using database queries:
- Use joins - Streaming results is faster - etc. "
That way, the next time you have to do something like this, you can remind it of / it will find the skill.
I laughed more at this than I probably should have, out of recognition.
When I’m writing my own code I can verify the logic as I go and coupled with a strong type system and a judicious use of _some_ tests its generally enough for my code to be correct.
By comparison the AI needs more tests to keep it on the right path otherwise the final code is not fit for purpose.
For example in a recent use case I needed to take a json blob containing an array of strings that contained numbers and needed to return an array of Decimals sorted in ascending order.
This seemed a perfect use case - a short well defined task with clear success criteria so I spent a bunch of time writing the requirements and building out a test suite and then let the AI do its thing.
The AI produced ok code, but it was sorted everything lexicographically before converting to a Decimal rather converting to Decimals first and sorting numerically so 1000 was less than 900.
So I point it out and the AI says good point, you’re absolutely correct and we add a test for this and it goes again and gets the right result but that’s not a mistake I would have made or needed a test for (though you could argue it’s a good test to have).
You could also argue that I should have specified the problem more clearly, but then we come back to the point that if I’m writing every specific detail in English first, it’s faster for me just to write it in code in the first place.
I feel this is a gross mischaracterization of any user flow involving using LLMs to generate code.
The hard part of generating code with LLMs is not how fast the code is generated. The hard part is verifying it actually does what it is expected to do. Unit tests too.
LLMs excel at spewing test cases, but you need to review each and every single test case to verify it does anything meaningful or valid and you need to iterate over tests to provide feedback on whether they are even green or what is the code coverage. That is the part that consumes time.
Claiming that LLMs are faster at generating code than you is like claiming that copy-and-pasting code out of Stack Overflow is faster than you writing it. Perhaps, but how can you tell if the code actually works?
"Write unit tests with full line and branch coverage for this function:
def add_two_numbers(x, y): return x + y + 1 "
Sometimes the LLM will point out that this function does not, in fact, return the sum of x and y. But more often, it will happily write "assert add_two_numbers(1, 1) == 3", without comment.
The big problem is that LLMs will assume that the code they are writing tests for is correct. This defeats the main purpose of writing tests, which is to find bugs in the code.
Run Cursor in “agent” mode, or create a Codex or Claude Code “unit test” skill. I recommend claude code.
Explain to the LLM that after it creates or modifies a test, it must run the test to confirm it passes. If it fails, it’s not allowed to edit the source code, instead it must determine if there is a bug in the test or the source code. If the test is buggy it should try again, if there is a bug in the source code it should pause, propose a fix, and consult with you on next steps.
The key insight here is you need to tell it that it’s not supposed to randomly edit the source code to make the test pass. I also recommend reviewing the unit tests at a high level, to make sure it didn’t hallucinate.