I'll use AI to design the implementation of a medium sized, cross cutting feature. Review all the details, maybe iterate on just that. Then implement with Claude 4.7 Max - which runs slower, but does a better job. Then review the implementation, then have Codex GPT 5.5 xhigh fast review it - which almost always finds corner cases. Have Claude fix those - Claude is better at writing intuitive maintainable code versus Codex overengineered/shortcut filled code. (Codex is better at finding/fixing bugs and doing reviews - it's annoyingly pedantic)
Then repeat with fresh Claude/Codex instances having them both review the current staged changes and getting feedback, handling the feedback. Then covering it in tests. I mean overall I still implement the feature faster than coding it manually, but I spend a majority of the time going back and forth with reviews, handling corner cases and at the finish end up with what I feel a really solid implementation of whatever feature I'm working on. The v1 feature feels more like a v3 given the amount of iteration it already went through.
When using Claude Code or Codex, that is all gone. Claude Code is extremely eager to reach the end goal to the point that it feels like a fever dream to write code with it. In the end, I have low confidence about edge cases and fit into the project's architectural and design goals.
On top of that, I enjoy programming, reverse engineering, etc. and I feel that the LLMs, while able to solve some problems or deliver some features, take that fun away. I'm trying really hard to find a workflow with them that I'm confident in, but I fear that workflow is just chat, search, and being a rubber duck for my thoughts.
A lot of people think a lot of things, but I don’t think the majority of people think the point of using LLMs is so they can produce low-quality code. Do they produce low-quality code sometimes or often? Of course. But they also produce high-quality code very often. And sometimes they just a “fine” job.
One of the promises - and there are plenty of cases where it’s met and where it falls drastically short - is that agentic coding tools can help us code faster that is just as good or better than what a human can. One of the other big ideal payoffs is that agentic coding can allow non-programmers to create things that previously required programmers to create.
We can debate as to how successful we’ve been toward the two goals above, but I think it’s misguided to say that the majority of people think LLMs should produce lower quality code.
I'm fairly AI-skeptical not on grounds of "do they work" but "are they good for the world". I feel that getting AIs to do this kind of review work is a rare case that doesn't outsource thinking and deskill workers. It doesn't trigger the same alarm bells as having the AI write the code (including having the AI fix the issues it discovers). That's setting aside environmental and other ethical concerns, which are still significant to me.
I have been impressed by the recent quality of AI code reviews*, but the experience of interacting with 3 separate AI reviewers via GitHub PRs is pretty terrible. Having more local-oriented and jj/rebase-aware review rounds would be great.
*context: fairly large PHP/Laravel backend and Vue frontend
[1]: https://milvus.io/blog/ai-code-review-gets-better-when-model...
Many AI models seem biased to cutting corners by default when generating code, even when you ask them not to. But a few simple follow up prompts can address that. Simply ask for covering corner cases with tests, test all the known non happy paths, look for weaknesses, verify adherence to SOLID principles, do security audits, etc. It will find issues. With bigger projects, you can actually make it file those issues in gh with labels and priorities. And then you can make it iterate on fixing issues with separate PRs.
On a recent project, I made it implement a simple benchmark test for measuring throughput. I had a hunch it was doing very sub optimal things. I then asked it to look for potential performance bottlenecks and use the benchmark to verify improvements. At that point I already had a lot of end to end tests to verify correctness. So, these performance tweaks were relatively low risk. I got about two orders of magnitude improvement and a lot more graceful behavior when pushed to the limit.
If you have a bit of experience engineering systems, just treat these tools like they are junior developers. Competent but likely to skip some essential steps. So, just double check with a lot pointed questions "did you do X? If not, do it now". Anything that needs repeated asking, turn it into a guard rail / skill.
There's a bit of effort and skill involved with this. I imagine a lot of less experienced developers might struggle to get good results because they aren't asking for the right things.
I'm following an Ideas -> PRD -> Issues -> Tasks methodology, where each task has a bunch of sub-tasks. I have it just do one (or a few, I'm having it do Red/Green/Refactor as separate sub-steps, so I review the Red case, and then once that's good, do the Green and Refactor steps, and review those).
I use these tools at both work and for personal side projects and I was expecting to watch and learn. But these opinion pieces without examples are way too many now.
My goal is to draft the solution with ai, write it myself but faster with auto complete, then throw ai review.
When LLMs started being somewhat useful for coding a few years ago, and I found they were in fact great at boilerplate, in fact pretty much only good at boilerplate ca 2023 or so, it got me thinking about all the accommodations we make in design and systems architecture that are sort of tacitly understanding who we're working with and their strengths and weaknesses.
The modern models have their own very different strengths and weaknesses compared to humans, and deploying them is a really interesting exercise of different architectural and engineering skills. I've enjoyed it, and hope I continue to.
But! Because of AI I was able to rapidly hack out like 4 variants of this feature that I didn't like. And felt comfortable throwing them away just as quick.
People believe that you can only use LLMs for sloppy programming. But you can also use it for writing ten times more code of Swiss cheese model tests, and domain specific languages.
You write ten times more code than necessary and all that extra code is testing. Projects like SqlLite do that because they need to be perfect.
Before LLMs we had to use engineers for that and it was a painful and repetitive work, and they were always late and made much more mistakes than LLMs, specially because it was dull and tedious for great engineers to spend their time into.
Now we write tests and when all test pass we write new test for checking the tests.
We divide each complex problem in small subproblems and we warrantee each of them by formal means. We have multiple ways of solving the same problem, usually with one brute force solution that is simple and warranted to work but inefficient, and we can use it to compare with more efficient methods.
Before machines could do that, people doing that were burned down and exhausted, and always leaved pending work to complete.
I open sourced it on GitHub, you may search alexwwang/tdd-pipeline to find it if you are interested in it.
I wonder how we can evaluate these two options: using AI to 100X the output versus using AI to advance one's craft.
In the meantime, the productivity gain of AI is real. Case in point, An engineering org of Snowflake has met all its OKRs ahead of time in the first quarter for the time in the company's history. It had never happened, and usually meeting 70% of the planned OKR would be considered an achievement. I can imagine the stress of the engineers when they see such outcome.
Having taste and the ability to author high quality prompts is still the most important thing. It was always the most important thing if you think abstractly about how all of this works.
You can very effectivly iterate alone using the LLM as a mirror, rephrasing what you put in and adding a bit.
You can use LLMs to quickly create prototypes to give to other human beings to help you with the next iteration.
If you get something from someone else to iterate on you can use the LLM to help you with understanding to rephrase things in a way more suitable for your understanding.
But instead everything anybody seems to be talking about seems to be one shoting things and AI iterating with other AI.
The big problem here is that the one thing AI does not have is agency. The naming AI agent is wishful thinking and marketing.
Man so much work to retrofit something that obviously, simply, plainly - just does not work. How about just writing the code yourself? You can even consult AI on the libraries or whatever, but how about just building that model in your head YOURSELF and not loading up on AI slop and trying to memorise that crap. The names of the functions will ring different in your memory once you spend some time thinking over whether you picked the right and clear name vs. just going with whatever statistical median the slop machine picked for you.
- Opus 4.7 writes the code - I make GPT-5.5 in Codex to review it (given context) - I provide the review back to Opus and ask it to verify the review findings - Make Opus plan the fixes then execute them - Ask GPT-5.5 to review the fixes and check if they solve the problems
Also feels much better than pure vibe-coding (which I still do for personal projects that aren't mission critical for anyone).
The downside is you use less tokens.
It's still very slow. It took me two hours to write code that generate JSON data and then to write a web page that displays a knowledge graph.
One thing you have to be aware is that the LLM will happily generate code for you and you have to discipline it from time to time. I notice that my reading comprehension begins to suffer if I don't write the code myself and have to understand what the LLM wrote for me as opposed to the LLM correcting where I went wrong.
One thing I would like to try with an LLM is understanding a large and complex existing codebase like OpenSCAD that doesn't leverage my existing skillset(high level programming languages with OpenSCAD as primary language in the past year). That has always been a barrier to contribution for me.
I’m not exactly sure what <foo> is but I feel it. I think it’s quality and authenticity and craftsmanship. That difference between an expensive tool and a cheap one that you can’t easily describe but you just know it.
Is there a word for this? I bet the Japanese or Germans have a word for this.
I use AI a lot now. But I also do it in small steps. It isn’t a craftsman, but it can help me be one.
what's wrong with (depending on the language) checkstyle, sonarlint, ruff, mypy, xmllint, and/or eslint?
This reminds me the article above. Now people have diverse ideas on agentic coding. Some suggest human-in-the-loop while others suggest giving a detailed specification and let the agent run freely; some suggest leveraging LLM's high productivity and here we get an opinion that LLM can actually slowly write good code.
It's happy to see opinions that are more practical and variant emerging, turning LLM into literally a tool instead of something to be hated or hyped.
In my own practice, I find LLMs (SOTA ones) good at medium-level tasks, those needed to reason and plan for a while. However, the design taste on architecture is unexpectedly disgusting. Sometimes writing interfaces myself and asking LLMs to fill in implementations, alongside context-completing tools like context7, deepwiki, docs.rs MCPs, etc. and giving a escape hatch (e.g. encouraging it to use the AskUser tool in Claude Code), may be considered my best practice.
By default it uses pi agent core + pi ai (from the excellent pi coding agent) as a multi model runtime but also supports a Claude Agent SDK runtime.
I can have an implementation and review process of an OpenSpec change run anywhere from 2 hours to 24+ hours going through review/fix/verification rounds automatically until the implementation matches the spec and any additional reviewers are done finding issues after the fix rounds.
it's going to be fully open sourced in the next two weeks and fully free to use
Great how the promoters are mirroring the current anti-AI sentiment. The next step is canceling all subscriptions and not using AI at all. Maybe your mind will work again.
I can relate to this. When I spend time on writing unit test , even the one which takes 1% of code coverage, it will be honestly wholesome moment for me to ship it confidently.
- Using AI to write the best code ever faster than any human ever could
- Using AI to write better code more slowly
- Using AI to write code that sucks even more slowly
- Using AI to stockpile horrendous ball of spaghetti code no one fucking understands which grows faster and faster despite going even more slowly
- Using Natural Intelligence to try and fail to untangle the mountain of spaghetti code
- Look guys, down with that AI, we've got a brand new shiny thing to throw trillions of VC dollars at!
(that people upvote to post their own thinkpieces in the comments)
I'm not 100x'ing my output like some people claim, but using it as a augmentation rather than delegating my work to it results in better code, and I don't lose context / control over my codebases. I really have read 100% of the code, because the LLM is generating smaller pieces around and inside my own written code. Works well enough for me, and open models are already both cheap enough and good enough for this workflow. This is why the big companies are so desperate to push full-on agentic hands-off workflows and developer replacement - that's the only way they won't go bankrupt.
There is a reason it is called slop. On first sight it is often not noticeable but when you dig deeper, you realise that it is often spam-slop. Of course this can be improved upon, but often there is no real improvement and you waste your own time in hope that things get better. Which high quality projects exist that are AI slop generated? Can people name something that is used by many people? The linux kernel? Something in that range? Including documentation? To me it seems people are chasing a dream here: skynet should write the code and they can sit on the beach, enjoying sunshine and fruits.