Saving money on tokens isn't something that's rewarded during performance reviews; particularly because it's difficult to quantify how much you saved versus hypothetically using a more expensive model.
What they wanted was for them to use both and feedback which was better.
The developers voted with their feet and didn’t use Copilot.
What Microsoft were hoping was that the opposite would happen...
Underlying model choice still has no restrictions. Opus 4.6 is by far the most popular. there's still big $$$ bills going anthropic's way.
I had far more hallucinations with 4.7 than 4.6.
I'll try it again after a few more months for them to get it right, but 4.6 is what changed my mind on LLMs as a tool, and 4.7 felt like a step backwards, so for now I'm sticking with something that has delivered me value, instead of arguing with a model ostensibly better that was making shit up 1 - 2 times a day. It was really disappointing.
I can give examples if needed, I screenshotted the most aggravating ones, but what worries me is which ones I didn't recognise.
4.7 IMO is around 10-20% worse at understanding your prompt intention. You need more effort to explain your intention clearer so it doesn't divert.
Although GPT's been acting weird since Thursday...
I've spent the last couple of days building Swift bindings to a monster CPP lib and I've actually had fun.
And you get a token based pricing since June 1.
Personally, I looked into Copilot's prompt and saw things that made me put it down immediately to start working on my own. I'm now using OpenCode for reasons and I like it better than any Big Ai tool. Using OC with Qwen3.6-MoE (for context) and generally happy with the results.
This was true in January -- since then, the Copilot CLI team has spent countless hours with engineering leaders and the biggest Claude Code users at the company to understand Copilot's shortcomings, define evals to properly test them head-to-head, and close the gap between the products.
The result? Claude Code usage was organically decreasing and Copilot CLI usage was organically increasing -- when this announcement was made, internal Copilot CLI usage had been greater than Claude Code usage for weeks!
Honestly I find GitHub Copilot CLI (and now also the new GitHub Copilot app) quite decent. I mostly use it with Opus 4.7, or rarely with GPT-5.5. The VSCode extension is ok, but CLI or app are the better experience IMO.
All I can say is I know because I know. There's been some "synergizing" among the corporation about the CLI team running off to do their own thing and adding features to the CLI that amount to trying to force a Terminal to act like a GUI.
https://code.visualstudio.com/docs/copilot/reference/workspa...
At the moment it seems like the way it's been trained has been tightly coupled with grep.
It does feel bizarre though that it doesn't use the symbol servers.
Especially if you want effective results.
MS thinks CoPilot is the Clark Griswold of LLMs when it's really Cousin Eddie...
These days I just use Claude Code Desktop or Claude Code in powershell. Standalone, not inside and IDE. Honestly, I'm using Desktop more and more as it gets more features.
The IDE is for me. No AI in it at all. If I want to get Claude to do something specific to a file I just @ the file.
if your response is "my prompts don't produce code that needs values flipped, ever." then I would wager you're only touching very simple things with an LLM.
for me I don't care about the token cost and prompt writing so much as the fact that it's just faster to change 0 to 1 and leaves me twiddling my thumbs for an llm output less.
Tab completion.
Smart model can cut down time to write complex firewall yaml dramatically, relying both on the existing file and the ugly draft (eg comma delimited details of the rules I need) I put out. It makes it 5 minutes lead time and 20 presses of tab instead of writing a shell/python full of edge cases or just copying existing rules as a template and laborously editing them -- smart model knows what the specific firewall needs.
But I'm not a developer, so I use both - haiku via github for tab completion and CC for cli.
I can also click on a file referenced by the AI and have it open immediately in the IDE so that I can inspect it.
Finally, it is a pain to write long, multi-line prompts in a CLI where you can't easily click around to edit different parts.
The primary weakness I've found in IDE based UI is that it struggles to get through the corporate security in order to run commands.
All of them are valid usecase of VSCode CC extension for me.
Obviously you want to be aware of what else is on the market, and use the right tool for the job -- but equally if you have a directly competing product, you'd prefer your org's telemetry and suggestions are directed towards improving your own software rather than your competitors'.
Compared to working at other big techs, where I was able to direct msg the engineers on the team for internal protobuf or datalake services in addition to user groups that were generally responsive it was just strange. Also Microsoft doesn't have a monorepo so you can't just commit patches to their service because you don't have access to their repos which I pretty regularly do elsewhere.
The Copilot CLI has ushered in the beginning of a change in this dogma -- I've helped dozens of Microsoft engineers get access to GitHub source code so they can contribute to Copilot CLI! It's fun to subvert expectations when a Microsoft IC pitches an improvement and I can respond with "submit a PR!"
Technically we're using Copilot and we're playing for it through Microsoft licenses, but it's using Opus 4.7. Even before this, most of our custom agents within m365 copilot were one of the GPT models.
Or maybe you're right and they want their developers to use the copilot models.
I haven't really used any other Copilot product in a while since they were so bad compared to our other corporate options, but I'm rather impressed with Cowork inside it. Exactly because we can actually use it without breaking any EU laws.
There's a large (and growing!) contingent of people who don't write code these days. (Many don't even use the keyboard.)
I think Kiro might have some “first mover” advantage internally, but CC feels better to use.
GitHub Copilot is in a somewhat similar place as Microsoft's toy but still different -- it was more or less the first coding agent/assistant, and GitHub/VSCode/Microsoft has enough user base and impact to influence individual users and enterprises' choices.
For Amazon's coding agent -- I just never see anyone outside Amazon even mentions Kiro or Amazon Q. Maybe a little bit when Kiro was offering tons of free credits. But I don't think it's even remotely relevant these days. I don't see news about companies adopting Kiro.
To me, it's just a matter of time before they are sunset, like Chime or a bunch of AWS products.
For Kiro, I agree with you, it seems like wasted effort and Anthropic / OpenAI are miles ahead in their tooling.
I love AWS at the infrastructure level, but their PaaS tends to be meh, and their end-user directed stuff is usually atrocious.
I've tried throwing unsupervised agentic software factory workflows against the wall, and they burned through my tokens like nobody's business but didn't produce much.
Supervised, human-in-the-loop process on the other hand is much more productive but doesn't consume nearly as much. Maybe that's why everyone's pushing agentic approaches so much.
Dealt with that by going all out and making an agentic parallel code review skill. Basically an infinite TODO list generator. Now I'm definitely getting 100% of the usage I paid for. It really burns tokens like nobody's business, and catches a lot of issues while at it. I've been looping this review/fix process every week. It's dramatically reduced the amount of stuff I need to pay attention to during my human review sessions.
There is this real danger that our thinking, and the things we make, become bloated without constraints.
IMO software has gone to shit since both mobile phones and laptops mostly have massive amounts of compute. We always seem to use it to the limit, just because it's there.
At least it's doing something productive instead of just sinking money into literal gambling simulators. Mercifully, unlike video games, automation is not "cheating".
https://github.com/matheusmoreira/.files/tree/master/~/.clau...
There are many "critics", one for each quality I want reviewed. Correctness, consistency, maintainability, security, testing... Everything I could think of, and I keep adding more.
https://github.com/matheusmoreira/.files/tree/master/~/.clau...
The scrutinize skill is the entry point. The Opus I'm talking to becomes an agent coordinator. He explores and autodiscovers the project's structure, subdivides it into logical sections.
Then he runs a truly absurd critic x section matrix against the entire project. Literally hundreds of these agents running in parallel, each focusing on one area. Ten minutes of this is enough to exhaust my Max 5x five hour window and put a serious dent in the weekly usage numbers.
It literally takes days to run a full agent sweep. I designed it around the rate limiting. The agents do file system style journaling in order to resume cleanly. They commit all of their findings as they go into an orphan branch in the repository. Further review runs can build on it and avoid searching for known issues.
The way it works in practice is I just run /scrutinize sweep and then go work on something else, or just go do my actual job, live my life, play video games, write an article for my blog or something. Come back five hours later to either resume the process or check the literally hundreds of issues that have been found by all the agents. Then Claude and myself will go in and evaluate and fix all of those issues one by one. Then review again. Then evaluate/fix again. I'm just gonna keep looping this over and over until zero issues are found. For all of my projects.
Going from solo hobbyist programmer to this was pretty insane. I can only imagine what these corporations with infinite money must be doing.
Isn't this a (mildly exaggerated) description of AWS, which is a very successful service?
So your costs scale with the number of users you have.
Thats an op ex that you can explain.
For tokens for developers its maybe closer, cost/outcome wise, to hiring an external consulting company to write your code; money paid scales with work done, no promise of delivery, arbitrary unpredictable external price changes.
Its not quite the same; though, similarly lucrative for consultants.
Like the other commenter said: cloud spend can also spin out of control if you don't pay attention, yet we've found ways to keep it under control (training, guardrails, limits, transparancy).
Personally, this feels like its just trying to push the work of managers in allocating resources onto developers so that they have more work to do and can be blamed if anything goes wrong.
Yes, but in a "oops this is gonna take another two months to finish" kind of way, not the "oops this is the 12th time this month 8 developers have burned $2K in tokens in a single day and no one really knows how it happened" kind of way.
people still can't get over the unreasonable effectiveness of algorithms.
I get the anti/skeptic sentiment. I've been called a lot of horrible things by a vocal contingent when they hear that I help train folks to learn software engineering best practices and then apply AI to that.
If this is the "analogy" you go for, you don't seem to be suited to make that comparison.
Colleague used Sonnet 4.6 on some pretty normal agentic coding tasks through AWS Bedrock to keep the data in the EU, 100 EUR usage in a single day. In comparison, the Mistral subscription costs about 20 EUR per month and we tested that for similar tasks it was okay, the usage got to around 10% of that monthly limit in a single day. Or Anthropic's own Max (5x) plan where you get way, way more tokens to do with as you please.
I feel like the sweet spot is having a monthly subscription with any of the providers (you're subsidized a bunch), but if you have to pay per tokens, now I'd just look in the direction of what tasks DeepSeek would be okay for, sadly probably not in the situation above. For a startup, though...
On the other hand, this feels a bit hypocritical:
> It was part of an effort to get project managers, designers, and other employees to experiment with coding for the first time, and sources tell me that Claude Code has proved very popular inside Microsoft over the past six months.
They're gonna say that the future is all AI... until they get the bill.
I upgraded my plan last night to Mistral Le Chat Teams. This now costs me €60 per month for two users. Limits have been reset, but I have no idea now if my per seat limit is higher than the Pro plan, or if the limit is shared between the seats, it’s really not clear. I guess I will find out next month. The limits reset on the first of the month and I really hope I don’t hit them in the next seven days.
I use Mistral Vibe CLI and I’ve written and implemented a couple of new skills[1]. Caveman, based on an idea I found online somewhere, this skill removes all extraneous response text, including articles. Makes for some fun reading, but supposedly reduces output tokens significantly. Hash-anchors, this one is based on a concept from Dirac[2], reduces search failures and also includes multi-file dispatch. It will be hard to measure, but Vibe tells me these two should result in roughly a 40% reduction in token burn.
The results for a function implementation and test of levenshtein distance in js are pretty similar but Mistral is 30x cheaper than Opus 4.7 and 4x faster than Sonnet 4.6.
Levenshtein distance is not only a well-understood problem, it's small, self-contained, and extremely well-represented in the training data. The kind of problem where even small/bad models can excel. The golden standard for those tasks is just "use a library" so no wonder the beefy models are expensive: you're chartering a commercial airplane to go grocery shopping.
My personal benchmarks are software engineering tasks (ideally spanning multiple packages in a monorepo) composed of many small decisions that, compounded, make or break the implementation and long-term maintainability.
There's where even frontier models struggle, which makes comparisons meaningful.
I mean, the will continue to say so, they just want to be the ones being paid for the service, not anthropic :)
I tend to work with the agent, and observe what's going on as well as review/test and work through results/changes. I spend a lot more time planning tasks/features than the execution, even using the agent as part of planning and pre-documentation. It works really well. I don't think people burning through the 5hr allotment in under an hour are actually reviewing/QC/QA the results of what they're doing in any meaningful way, and likely producing as much garbage as good (slop).
I'm really curious as to HOW the MS employees were using the agents as much as what they were doing.
By buying a subscription and dealing with the limits, using claude code and paying per token seems like the fast lane to the poor house.
Me: We need to do this this that.
Claude: <random stuff that approximates human outout>
Me: Are you sure?
Claude: Well actually there is a bug <more random stuff that looks right this time>
----- Now it is:
Me: We need to do this this that.
Claude: <random stuff that approximates human outout>
Claude: Let me consult the advisor on that.
Claude: advisor came up with some advice, adjusting according to that. <more random stuff that looks right this time>
> I understand that Microsoft is planning to remove most of its Claude Code licenses and push many of its developers to use Copilot CLI instead. While Claude Code has been a popular addition, it has also undermined Microsoft’s new GitHub Copilot CLI coding tool — a command line version of GitHub Copilot that runs outside of development apps like Visual Studio Code.
And people here are interpreting this as related mainly to the Claude burning too much tokens too quickly and suggesting Microsoft should rather use SomeOtherLLM©?
Is this Hacker News or rather Marketing Wars?
No public forum is naturally immune to the spread of (guerilla) marketing. [1]
[1] Internet Rule #48
Eso mensaje de hijo de Carlos
I've launched an internal demo of Claude Code and Deepseek on the same day and we burned through our monthly allowance for Claude in just over a week, with more than a half of that budget being spent in one day. With DS people are unable to go through that same amount of money in a month, not even close.
With that Claude feels like an expensive toy, while DS is a shovel, purely because developers do not feel like they are eating into a precious resource while using it. Also it does not feel like there is much of a difference in capability between Claude and DS-pro. DS-pro and flash do feel like sonnet/opus and haiku, but flash is still very-very capable.
After 2 weeks of Claude getting progressively worse and worse, today was the final straw.
I don't care if they have a phone app. The model is COMPLETE garbage after you subscribe long enough and they think they've "got you".
I can't code on my phone if the model literally moves in the wrong direction and does the opposite of what I tell it to. If I wanted to make my code worse, I'd just randomly commit garbage. I don't need a mobile app for that.
Opus has been dumb this week.
Claude was having a lot of capacity problems and downtime and then this week that has been much less obvious... and the model is dumber.
It could also just be luck and my impressions are false... who knows.
When you're on a mature codebase with 500k+ lines of code, I haven't seen anything else be as effective as 4.7.
People heard "Claude is nerfed" and now they see it everywhere, they notice failures a lot more than they would have otherwise.
Doesn't matter that Claude is not, in fact, nerfed. Perception is powerful and most humans are not rational.
Tell it what to do.
Commit, push to origin, review on GitHub.
Tell it to make changes, amend the commit, push --force-with-lease.
I'm attempting to make a memory safe language like Rust but with a substantially lower learning curve and added safety (but non-zero cost abstractions) fully with AI, almost entirely from my phone, commuting, getting coffee, walking the dog, between sets at the gym, replacing doom scrolling before bed and during lunch, etc.
Mostly to test how much LLMs can actually scale development.
Depending on how long it takes them to clean up some architectural slop in the MIR lowering phase, the results could either be very impressive or not.
From a purely cost basis perspective, it's hard to argue they aren't killing it.
But from a multiplier perspective, it's up in the air how great they are.
It's proven to be a really nice experiment, because much of what I wanted to solve with a language is the problems inherent to LLM development.
So at the self hosting phase, I get a great opportunity to see if the language can actually deliver on what I dream for.
This was all supposed to be worked out prior to Cloud Next, but it wasn't. Ironically, they mentioned Claude in a few of their presentations at next.
And that was our solution. We are a big GCP customer but our whole team is on Claude now and much happier.
1. right now, usage correlates with experimentation and learning, few if anyone knows how to make these things effective on their own over long sessions of activity
2. long term, you should be using more than one agent at a time, because they are running in the background based on events (new direct message / something happened in eg. github)
This would never fly if stock market was rational. But it never is.
I wonder if this will happen before they have some obligatory debloating of the investors exposition to the company.
With research and hardware near guaranteed to bring the efficiency way up, I'm not scared here of massive price hikes.
There is no moat.
This is, in my opinion, tripe. SWEs are being laid off because of post-Covid over-hiring. The only evidence for labour destruction is in junior hires. But not because anyone is being fired, but because entry-level jobs are being cannibalised.
Nobody can make a profit with AI. Any clever idea can be cloned with AI, competition makes it unprofitable. No moat, no arbitrage opportunity. "During the gold rush, the only people making money were the men selling shovels."
We can definitely do amazing things with AI, and it makes us have superpowers, but so does everyone else. My competition also uses AI. I have to keep up with an AI powered competition now.
So you're getting 2 for the price of 1.5. Scale that up to 500 devs at a big company and it's a big chunk of change saved on payroll.
Keeping your headcount or hiring humans instead, AI would have to start to cost upwards of $15k/month/developer or more before it costs more than hiring. You're looking at about 4 billion tokens per month before humans start to break even or are cheaper.
But even taking a more realistic 1.25x (20% time savings) gain, lets say you drop from 500 to 400 devs, you'd have to hit around $4,000/dev/month in token spend before hiring humans again would break even.
Payroll is just expensive, in most companies it's by far the biggest expense. AI still has to cost drastically more before investors would call it out as being worse than increasing headcount, from a pure dollars perspective.
While LLM Opex is "some future quarter" and very easy to co-mingle with other expenses.
I found Opus 4.7 to be slow and wasteful with token usage. It's shocking how inefficient it is with tasks like bash tool usage and web searching, delegating them to a dozen subagents only to get stuck and never return until you esc and intervene. That, in addition to all of the broken tooling Anthropic built in to limit token usage like the broken monitoring tool made managing Claude a chore. I was happy to pay $200/month for Opus 4.5 when they had more capacity, but 4.7 felt like a huge step back and no longer worth the price and inconvenience.
I remember an OpenAI employee comment on the GPT5.5 release post about how they specifically geared it towards long-horizon tasks and its been a breathe of fresh air in that regard. I have five two-week long sessions going right now and there's been no degradation in performance or efficiency. It's much better at carrying rules/learnings forward even in long-running sessions and grounding/refreshing itself in verified facts when it loses context.
Its funny because in two weeks I've gotten way more done with GPT5.5 with way fewer tokens and way less handholding. I think this goes to show how important tooling and the harness is and how a capable model like Opus 4.7 can be severely handicapped by bad product decisions.
I expect the r/LocalLLaMA guys to be going nuts about this news.
> It was part of an effort to get project managers, designers, and other employees to experiment with coding for the first time.
I suspect they weren't as efficient as they could be with token use either. Sounds like they were trying to encourage non-developers to vibe code stuff
Between Copilot, Claude, and Gemini, I still actually prefer Gemini. I do a lot of scientific writing in addition to coding and Gemini is the only model I can trust to “just be right”. This trust then transfers over to its code output.
Github Copilot offered probably the best value and was IMO underappreciated for a long time; I've been an annual subscriber since day 1.
The changes announced a few days ago completely revoke that value proposition, I doubt I'll continue with it.
Changes to GitHub Copilot individual plans
https://news.ycombinator.com/item?id=47838508
GitHub Copilot is moving to usage-based billing
https://news.ycombinator.com/item?id=47923357
Multipliers for annual subscribers:
https://docs.github.com/en/copilot/reference/copilot-billing...
New pricing model changes that. I will still keep it around for autocompletion (for the rare times when I open up an editor).
Arguably, Copilot is GPT 5? Not sure what the CLI offers behind the covers.
The CLI can swap to whatever model (/models) based on your subscriptions.
The copilots on desktop or Office Apps are likely just GPT5 nano or other tiny models with cheap inference
It. is. so. bad.
It feels like it's at least 1-2 years behind the current top models.
Speed without judgement always compounds badly.
https://github.blog/news-insights/company-news/github-copilo...
Claude tokens are priced by GitHub at a disproportionately premium price compared to Gemini and OpenAI. I wonder why?
https://docs.github.com/en/copilot/reference/copilot-billing...
Similarly companies seem to reward high token usage as a sign of someone willing to play ball with AI and again have forced higher costs on themselves for people reward hacking or using tokens out of spite.
Fun fact, up until you face a consequence for crime, all crime is free! Have fun and go win the competition game against your co-workers.
Also it became very hard to convince management to keep both Claude code and GitHub Copilot enterprise licenses.
This is a warning to any company, not building their own AI, that AI assisted development could become really expensive really fast and most likely won't pay off. What Microsoft is suggesting is that the current price is to high, but it's still not high enough for e.g. Anthropic to be profitable, or AI coding tools are only as good as the developers using them. So you can't meaningfully do layoffs by replacing the developers with AIs, because the cost is to high.
How does Microsoft plan to fix CoPilot, so that the cost will be so much lower than Claude, that budget overruns won't be a problem for their own customer?
Smaller companies will have departments that distill larger models into something more specifically manageable and useful for them. At least, that's my personal prediction :)
I do think your prediction makes sense, because the AI really isn't the product, it needs to be baked into something and licensing the models saves you the R&D and cost of implementing your own.
In order to do that they'd have to make a concrete business case to justify the headcount and compute costs. They'd be facing the same fundamental economic problems Anthropic, OpenAI, MSFT, etc are facing just at a department level instead of a megacorp level. I hope they try it, sunlight is the best disinfectant.
However, when the pressure is turned up and people have to actually show results--and, like, be accountable--instead of just buying a subscription and externalizing the accountability, I don't think we'll see so much enthusiasm about AI coding. Whether or not an engineer is actually more or less productive with AI (not merely whether they feel more productive) will begin to matter a lot more. I don't see how people continue using AI in this hypothetical small company under those adverse conditions.
There may be a spot of “good enough to pay for and make a profit” that exists.
The frontier model space costs 1000x as much to develop as the small language models, and is only 1.5 years ahead.
Factually, the frontier models have not paid for themselves. So, if you're MSFT and Apple, you don't need to run in a race where even the winner loses massively.
You can try to train models 1.5 years behind that are highly likely to be profitable, given your market position.
The average person is lagging behind what AI is capable of by 3+ years anyway...
So you can save 1000x on training and 10x on inference and just use SOTA small models.
Why spend $5B training a model that's for sure not going to make $5B (after inference costs) when you can spend $5M building one that WILL make far more than that after inference costs?
At one point there were rumours that they'd do that. They also have the rigts to oAI models for a few more years still, so they could always use that but apparently they're also compute starved (like anyone else).
There are papers describing KV cache precomputation for commonly used documents (e.g. KVLink), but, of course, it's not a priority for model providers: they'd rather sell you more tokens, also they would rather get to AGI/ASI first than optimize usage of existing models...
Normally KV cache works only if your context prefix is identical, but there are papers which demonstrate documents can be cached between different contexts.
If anything, it's forced dogfooding, i.e., forcing their own workforce to beta-test their product.
Side note, it's so frustrating that The Verge puts a paywall at the fold. It makes me feel like the rest of the story is not worth reading. I'm not inclined to pay $2 to read a link that was posted on an aggregator.
call me a luddite, i'll be wearing it as a badge of honor
At least Codex is trying to win validation on merit.