- Large C codebase (new feature and bugfix)
- Small rust codebase (new feature)
- Brand new greenfield frontend for an in-spec and documented openAPI API
- Small fixes to an existing frontend
It failed _dramatically_ in all cases. Maybe I'm using this thing wrong but it is devin-level fail. Gets diffs wrong. Passes phantom arguments to tools. Screws up basic features. Pulls in hundreds of line changes on unrelated files to refactor. Refactors again and again, over itself, partially, so that the uncompleted boneyard of an old refactor sits in the codebase like a skeleton (those tokens are also sent up to the model).
It genuinely makes an insane, horrible, spaghetti MESS of the codebase. Any codebase. I expected it to be good at svelte and solidJS since those are popular javascript frameworks with lots of training data. Nope, it's bad. This was a few days ago, Claude 4. Seriously, seriously people what am I missing here with this agents thing. They are such gluttonous eaters of tokens that I'm beginning to think these agent posts are paid advertising.
An interesting thing about many of these types of posts is they never actually detail the tools they use and how they use them to achieve their results. It shouldn’t even be that hard for them to do, they could just have their agent do it for them.
You may be right. The author of this one even says if you spend time prettying your code you should stop yak shaving. They apparently don't care about code quality.
brought to you by fly.io, where the corporate blog literally tells you to shove your concerns up your ass:
> Cut me a little slack as I ask you to shove this concern up your ass.
A prompt like “Write a $x program that does $y” is generally going to produce some pretty poor code. You generally want to include a lot of details and desires in your prompt. And include something like “Ask clarifying questions until you can provide a good solution”.
A lot of the people who complain about poor code generation use poor prompting.
Simon Willison has some great examples in his blog and on his GitHub. Check out Karpathy’s YouTube videos as well.
I've been developing my prompting skills for nearly three years now and I still constantly find new and better ways to prompt.
I also consider knowing what "use a reasoning model" means to be part of that skill!
As with any other project, it’s best to specify your wants and needs than to let someone or an LLM to guess.
So I'd say claude 4 agents today are at smart but fresh intern level of autonomy. You still have to do the high level planning and task break down, but it can execute on tasks (say requiring 10 - 200 lines of code excluding tests). Any asking it to write much more code (200+ lines) often require a lot of follow ups and disappointment.
A significant portion of my prompts are writing and reading from .md files, which plan and document the progress.
When I start a new feature, it begins with: We need to add a new feature X that does ABC, create a .md in /docs to plan this feature. Ask me questions to help scope the feature.
I then manually edit the feature-x.md file, and only then tell the tool to implement it.
Also, after any major change, I say: Add this to docs/current_app_understanding.md.
Every single chat starts with: Read docs/current_app_understanding.md to get up to speed.
The really cool side benefit here is that I end up with solid docs, which I admittedly would have never created in the past.
You don't exactly need to know prompting, you just need to know how to ask the AI to help you prompt it.
Writing code is one thing that models can do when wired properly, and you can get a powerful productivity boost, but wielding the tools well is a skill of it's own, and results will vary by task, with each model having unique strengths. The most important skill is understanding the limitations.
Based on your task descriptions and the implied expectation, I'm unsurprised that you are frustrated with the results. For good results with anything requiring architecture decisions have a discussion with the model about architecture design, before diving in. Come up with a step by step plan and work through it together. Models are not like people, they know everything and nothing.
We’ve built tools to help us with the first part, framework with the second, architecture principles with the third and software engineering techniques for the fourth. Where do LLMs help?
With my async agent I do not care about how easy it is for me, it’s easier to tell the agent to do the workflow and comeback to it later when I’m ready to review it. If it’s a good change I approve the pr, if not I close it.
I'm 100% certain most if not all of them are, there is simply too much money flying around and I've seen things that marketing does in the past for way less hyped products. Though in this specific case I think the writer may simply be shilling AI to create demand for their service. Pay us monthly to one click deploy your broken incomplete AI slop. The app doesn't work? No problem just keep prompting harder and paying us more to host/build/test/deploy it...
I've also tried the agent thing and still am with only moderate success. Cursor, Claud-squad, goose, dagger AI agents. In other words all the new hotness, all with various features claiming to solve the fact that agents don't work. Guess what? they still don't.
But hey this is HN? most of the posters are tech fearing luddies right? All the contention on here must mean our grindset is wrong and we are not prompting hard enough.
There is even one shill Ghuntly that claims you need to be "redlining" ai at the cost of $500-$1000 per day to get the full benefits. LOL if that is not a veiled advertisement I don't know what is.
However, a counter argument to all this;
Does it matter if the code is messy?
None of this matters to the users and people who only know how to vibe code.
It matters proportionally to the amount of time I intend to maintain it for, and the amount of maintenance expected.