And I get it. Coding with Claude Code really was prompting something, getting errors, and asking it to fix it. Which was still useful but I could see why a skilled coder adding a feature to a complex codebase would just give up
Opus 4.5 really is at a new tier however. It just...works. The errors are far fewer and often very minor - "careless" errors, not fundamental issues (like forgetting to add "use client" to a nextjs client component.
I even gave -True Vibe Coding- a whirl. Yesterday, from a blank directory and text file list of requirements, I had Opus 4.5 build an Android TV video player that could read a directory over NFS, show a grid view of movie poster thumbnails, and play the selected video file on the TV. The result wasn't exactly full-featured Kodi, but it works in the emulator and actual device, it has no memory leaks, crashes, ANRs, no performance problems, no network latency bugs or anything. It was pretty astounding.
Oh, and I did this all without ever opening a single source file or even looking at the proposed code changes while Opus was doing its thing. I don't even know Kotlin and still don't know it.
For instance, I always respected types, but I'm too lazy to go spend hours working on types when I can just do ruby-style duck typing and get a long ways before the inevitable problems rear their head. Now, I can use a strongly typed language and get the advantages for "free".
Oh absolutely. I've been using Python for past 15 or so years for everything.
I've never written a single line of Rust in my life, and all my new projects are Rust now, even the quick-script-throwaway things, because it's so much better at instantly screaming at claude when it goes off track. It may take it longer to finish what I asked it to do, but requires so much less involvement from me.
I will likely never start another new project in python ever.
EDIT: Forgot to add that paired with a good linter, this is even more impressive. I told Claude to come up with the most masochistic clippy configuration possible, where even a tiny mistake is instantly punished and exceptions have to be truly exceptional (I have another agent that verifies this each run).
I just wish there was cargo-clippy for enforcing architectural patterns.
I find I'm doing more Typescript projects than Python because of the superior typing, despite the fact I prefer Python.
Going to one-up you though -- here's a literal one-liner that gets me a polished media center with beautiful interface and powerful skinning engine. It supports Android, BSD, Linux, macOS, iOS, tvOS and Windows.
`git clone https://github.com/xbmc/xbmc.git`
A big part of the problem with existing software is that humans seem to be pretty much incapable of deciding a project is done and stop adding to it. We treat creating code like a job or hobby instead of a tool. Nothing wrong with that, unless you're advertising it as a tool.
At work I only have access to calude using the GitHub copilot integration so this could be the cause of my problems. Claude was able to get slthe first iteration up pretty quick. At that stage the app could create a plot and you could interact with it and ask follow up questions.
Then I asked it to extend the app so that it could generate multiple plots and the user could interact with all of them one at a time. It made a bunch of changes but the feature was never implemented. I asked it to do again but got the same outcome. I completely accept the fact that it could just be all because I am using vscode copilot or my promoting skills are not good but the LLM got 70% of the way there and then completely failed
You really need to at least try Claude Code directly instead of using CoPilot. My work gives us access to CoPilot, Claude Code, and Codex. CoPilot isn’t close to the other more agentic products.
but I do want a better way to glance and keep up with what its doing in longer conversations, for my own mental context window
I'm finding that branding and graphic design is the most arduous part, that I'm hoping to accelerate soon. I'm heavily AI assisted there too and I'm evaluating MCP servers to help, but so far I do actually have to focus on just that part as opposed to babysit
If you were someone saying at the start of 2025 "this is a flash in the pan and a bunch of hype, it's not going to fundamentally change how we write code", that was still a reasonable belief to hold back then. At the start of 2026 that position is basically untenable: it's just burying your head in the sand and wishing for AI to go away. If you're someone who still holds it you really really need to download Claude Code and set it to Opus and start trying it with an open mind: I don't know what else to tell you. So now the question has shifted from whether this is going to transform our profession (it is), to how exactly it's going to play out. I personally don't think we will be replacing human engineers anytime soon ("coders", maybe!), but I'm prepared to change my mind on that too if the facts change. We'll see.
I was a fellow mind-changer, although it was back around the first half of last year when Claude Code was good enough to do things for me in a mature codebase under supervision. It clearly still had a long way to go but it was at that tipping point from "not really useful" to "useful". But Opus 4.5 is something different - I don't feel I have to keep pulling it back on track in quite the way I used to with Sonnet 3.7, 4, even Sonnet 4.5.
For the record, I still think we're in a bubble. AI companies are overvalued. But that's a separate question from whether this is going to change the software development profession.
I'm hoping after AI comes back down to earth there will be a new glut of cheap second hand GPUs and RAM to get snapped up.
... says it all.
This is what people are still doing wrong. Tools in a loop people, tools in a loop.
The agent has to have the tools to detect whatever it just created is producing errors during linting/testing/running. When it can do that, I can loop again, fix the error and again - use the tools to see whether it worked.
I _still_ encounter people who think "AI programming" is pasting stuff into ChatGPT on the browser and they complain it hallucinates functions and produces invalid code.
Well, d'oh.
I was having Opus investigate it and I kept building and deploying the firmware for testing.. then I just figured I'd explain how it could do the same and pull the logs.
Off it went, for the next ~15 minutes it would flash the firmware multiple times until it figured out the issue and fixed it.
There was something so interesting about seeing a microcontroller on the desk being flashed by Claude Code, with LEDs blinking indicating failure states. There's something about it not being just code on your laptop that felt so interesting to me.
But I agree, absolutely, red/green test or have a way of validating (linting, testing, whatever it is) and explain the end-to-end loop, then the agent is able to work much faster without being blocked by you multiple times along the way.
While Claude is amazing at writing code, it still requires human operators. And even experienced human operators are bad at operating this machinery.
Tell your average joe - the one who thinks they can create software without engineers - what "tools-in-a-loop" means, and they'll make the same face they made when you tried explaining iterators to them, before LLMs.
Explain to them how typing system, E2E or integration test helps the agent, and suddenly, they now have to learn all the things they would be required to learn to be able to write on their own.
My team is using it with Claude Code and say it works brilliantly, so I'll be giving it another go.
How much of the value comes from Opus 4.5, how much comes from Claude Code, and how much comes from the combination?
Copilot has by far the best and most intuitive agent UI. Just make sure you're in agent mode and choose Sonnet or Opus models.
I've just cancelled my Claude sub and gone back and will upgrade to the GH Pro+ to get more sonnet/opus.
The gap didn't seem big, but in November (which admittedly was when Opus 4.5 was in preview on Copilot) Opus 4.5 with Copilot was awful.
It's an opportunity, not a problem. Because it means there's a gap in your specifications and then your tests.
I use Aider not Claude but I run it with Anthropic models. And what I found is that comprehensively writing up the documentation for a feature spec style before starting eliminates a huge amount of what you're referring to. It serves a triple purpose (a) you get the documentation, (b) you guide the AI and (c) it's surprising how often this helps to refine the feature itself. Sometimes I invoke the AI to help me write the spec as well, asking it to prompt for areas where clarification is needed etc.
With the latest models if you're clear enough with your requirements you'll usually find it does the right thing on the first try.
1) There exists a threshold, only identifiable in retrospect, past which it would have been faster to locate or write the code yourself than to navigate the LLM's correction loop or otherwise ensure one-shot success.
2) The intuition and motivations of LLMs derive from a latent space that the LLM cannot actually access. I cannot get a reliable answer on why the LLM chose the approaches it did; it can only retroactively confabulate. Unlike human developers who can recall off-hand, or at least review associated tickets and meeting notes to jog their memory. The LLM prompter always documenting sufficiently to bridge this LLM provenance gap hits rub #1.
3) Gradually building prompt dependency where one's ability to take over from the LLM declines and one can no longer answer questions or develop at the same velocity themselves.
4) My development costs increasingly being determined by the AI labs and hardware vendors they partner with. Particularly when the former will need to increase prices dramatically over the coming years to break even with even 2025 economics.
This is why you use this AI bubble (it IS a bubble) to use the VC-funded AI models for dirt cheap prices and CREATE tools for yourself.
Need a very specific linter? AI can do it. Need a complex Roslyn analyser? AI. Any kind of scripting or automation that you run on your own machine. AI.
None of that will go away or suddenly stop working when the bubble bursts.
Within just the last 6 months I've built so many little utilities to speed up my work (and personal life) it's completely bonkers. Most went from "hmm, might be cool to..." to a good-enough script/program in an evening while doing chores.
Even better, start getting the feel for local models. Current gen home hardware is getting good enough and the local models smart enough so you can, with the correct tooling, use them for suprisingly many things.
I can run multiple agents at once, across multiple code bases (or the same codebase but multiple different branches), doing the same or different things. You absolutely can't keep up with that. Maybe the one singular task you were working on, sure, but the fact that I can work on multiple different things without the same cognitive load will blow you out of the water.
> 2) The intuition and motivations of LLMs derive from a latent space that the LLM cannot actually access. I cannot get a reliable answer on why the LLM chose the approaches it did; it can only retroactively confabulate. Unlike human developers who can recall off-hand, or at least review associated tickets and meeting notes to jog their memory. The LLM prompter always documenting sufficiently to bridge this LLM provenance gap hits rub #1.
Tell the LLM to document in comments why it did things. Human developers often leave and then people with no knowledge of their codebase or their "whys" are even around to give details. Devs are notoriously terrible about documentation.
> 3) Gradually building prompt dependency where one's ability to take over from the LLM declines and one can no longer answer questions or develop at the same velocity themselves.
You can't develop at the same velocity, so drop that assumption now. There's all kinds of lower abstractions that you build on top of that you probably can't explain currently.
> 4) My development costs increasingly being determined by the AI labs and hardware vendors they partner with. Particularly when the former will need to increase prices dramatically over the coming years to break even with even 2025 economics.
You aren't keeping up with the actual economics. This shit is technically profitable, the unprofitable part is the ongoing battle between LLM providers to have the best model. They know software in the past has often been winner takes all so they're all trying to win.
That's great that this is your experience, but it's not a lot of people's. There are projects where it's just not going to know what to do.
I'm working in a web framework that is a Frankenstein-ing of Laravel and October CMS. It's so easy for the agent to get confused because, even when I tell it this is a different framework, it sees things that look like Laravel or October CMS and suggests solutions that are only for those frameworks. So there's constant made up methods and getting stuck in loops.
The documentation is terrible, you just have to read the code. Which, despite what people say, Cursor is terrible at, because embeddings are not a real way to read a codebase.
One trick I use that might work for you as well:
Clone GitHub.com/simonw/datasette to /tmp
then look at /tmp/docs/datasette for
documentation and search the code
if you need to
Try that with your own custom framework and it might unblock things.If your framework is missing documentation tell Claude Code to write itself some documentation based on what it learns from reading the code!
It turns out that Waterfall was always the correct method, it's just really slow ;)
There's a lesser chance that you're working on a code base that Claude Code just isn't capable of helping with.
Putting “Make minimal changes” in my standard prompt helped a lot with the tendency of basically all agents to make too many changes at once. With that addition it became possible to direct the LLM to make something similar to the logical progression of commits I would have made anyway, but now don’t have to work as hard at crafting.
Most of the hype merchants avoid the topic of maintainability because they’re playing to non-technical management skeptical of the importance of engineering fundamentals. But everything I’ve experienced so far working with LLMs screams that the fundamentals are more important than ever.
It's not just the deficiencies of earlier versions, but the mismatch between the praise from AI enthusiasts and the reality.
I mean maybe it is really different now and I should definitely try uploading all of my employer's IP on Claude's cloud and see how well it works. But so many people were as hyped by GPT-4 as they are now, despite GPT-4 actually being underwhelming.
Too much hype for disappointing results leads to skepticism later on, even when the product has improved.
The funny part about rapidly changing industries is that, despite the fomo, there's honestly not any reward to keeping up unless you want to be a consultant. Otherwise, wait and see what sticks. If this summer people are still citing the Opus 4.5 was a game changing moment and have solid, repeatable workflows, then I'll happily change up my workflow.
Someone could walk into the LLM space today and wouldn't be significantly at a loss for not having paid attention to anything that had happened in the last 4 years other than learning what has stuck since then.
LMAO what???
Two big examples:
- Period from early mvc JavaScript frontends (backbone.js etc) and the time of the great React/Angular wars. I completely stepped out of the webdev space during that time period.
- The rapid expansion of Deep Learning frameworks where I did try to keep up (shipped some Lua torch packages and made minor contributions to Pylearn2).
In the first case, missing 5 years of front-end wars had zero impact. After not doing webdev work at all for 5-years I was tasked with shipping a React app. It took me a week to catch up, and everything was deployed in roughly the same time as someone would have had they spent years keeping up with changes.
In the second case, where I did keep up with many of the developing deep learning frameworks, it didn't really confer any advantage. Coworkers who I worked with who started with Pytorch fresh out of school were just as proficient, if not more so, with building models. Spending energy keeping up offered no value other than feeling "current" at the time.
Can you give me a counter example of where keeping up with a rapidly changing area that's unstable has conferred a benefit to you? Most of FOMO is really just fear. Again, unless you're trying to sell your self specifically as a consultant on the bleeding edge, there's no reason to keep up with all these changes (other than finding it fun).
Right or wrong - doesn’t matter. You typed in a line of text and now your computer is making 3000 word stories, images, even videos based on it
How are you NOT astounded by that? We used to have NONE of this even 4 years ago!
-- Charles Babbage
Literally tried it yesterday. I didn't see a single difference with whatever model Claude Code was using two months ago. Same crippled context window. Same "I'll read 10 irrelevant lines from a file", same random changes etc.
Create a markdown document of your task (or use CLAUDE.md), put it in "plan mode" which allows Claude to use tool calls to ask questions before it generates the plan.
When it finishes one part of the plan, have it create a another markdown document - "progress.md" or whatever with the whole plan and what is completed at that point.
Type /clear (no more context window), tell Claude to read the two documents.
Repeat until even a massive project is complete - with those 2 markdown documents and no context window issues.
... Proceeds to explain how it's crippled and all the workarounds you have to do to make it less crippled.
No - that's not what I did.
You don't need an extra-long context full of irrelevant tokens. Claude doesn't need to see the code it implemented 40 steps ago in a working method from Phase 1 if it is on Phase 3 and not using that method. It doesn't need reasoning traces for things it already "thought" through.
This other information is cluttering, not helpful. It is making signal to noise ratio worse.
If Claude needs to know something it did in Phase 1 for Phase 4 it will put a note on it in the living markdown document to simply find it again when it needs it.
Yup. There's some magical "right context" that will fix all the problems. What is that right context? No idea, I guess I need to read a yet-another 20 000-word post describing magical incantations that you should or shouldn't do in the context.
The "Opus 4.5 is something else/nex tier/just works" claims in my mind means that I wouldn't need to babysit its every decision, or that it would actually read relevant lines from relevant files etc. Nope. Exact same behaviors as whatever the previous model was.
Oh, and that "200k tokens context window"? It's a lie. The quality quickly degrades as soon as Claude reaches somewhere around 50% of the context window. At 80+% it's nearly indistinguishable from a model from two years ago. (BTW, same for Codex/GPT with it's "1 million token window")
1) define problem
2) split problem into small independently verifiable tasks
3) implement tasks one by one, verify with tools
With humans 1) is the spec, 2) is the Jira or whatever tasksWith an LLM usually 1) is just a markdown file, 2) is a markdown checklist, Github issues (which Claude can use with the `gh` cli) and every loop of 3 gets a fresh context, maybe the spec from step 1 and the relevant task information from 2
I haven't ran into context issues in a LONG time, and if I have it's usually been either intentional (it's a problem where compacting wont' hurt) or an error on my part.
It's in your interest to deal with your frustration and figure out how you can leverage the new tools to stay relevant (to the degree that you want to).
Regarding the context window, Claude needs thinking turned up for long context accuracy, it's quite forgetful without thinking.
All I can tell you is that in my own lived experience, I've had some fantastic results from AI, and it comes from telling it "look at this thing here, ok, i want you to chain it to that, please consider this factor, don't forget that... blah blah blah" like how I would have spelled things out to a junior developer, and then it really does stand a really solid chance of turning out what I've asked for. It helps a lot that I know what to ask for; there's no replacing that with AI yet.
So, your own situation must fall into one of these coarse buckets:
- You're doing something way too hard for AI to have a chance at yet, like real science / engineering at the frontier, not just boring software or infra development
- Your prompts aren't specific enough, you're not feeding it context, and you're expecting it to one-shot things perfectly instead of having to spend an afternoon prompting and correcting stuff
- You're not actually using and getting better at the tools, so you're just shouting criticisms from the sidelines, perhaps as sour grape because you're not allowed by policy / company can't afford to have you get into it.
IDK. I hope it's the first one and you're just doing Really Hard Things, but if you're doing normal software developer stuff and not seeing a productivity advantage, it's a fucking skill issue.
2. It's the same workarounds we've been doing forever
3. It's indistinguishable from "clear context and re-feed the entire world of relevant info from scratch" we've had forever, just slightly more automated
That's why I don't understand all the "it's new tier" etc. It's all the same issues with all the same workarounds.