Saving money on tokens isn't something that's rewarded during performance reviews; particularly because it's difficult to quantify how much you saved versus hypothetically using a more expensive model.
Churning out useful code quickly is not solved by using more tokens per unit time. Most non-technical leaders can grasp this one and are likely more interested in the strategic game theoretical dynamics that are being forced by way of implied token consumption expectations (competition between developers).
If you want to hold out as long as possible and don't really care about anything other than the compensation package, you should at least play along with this new game in a half-assed manner. Try to goldilocks your token usage between any established extremes. You want to be in the statistical barycenter of every AI report that management can create.
Where we were 6mo ago is that a lot of big orgs realized they were behind, and needed some way of measuring if the tools were usable at all.
No sawdust at all on your job site, and you can tell nobody is cutting wood.
Now that tooling is more mature, you can measure things like % of diffs AI-generated, % of AI suggestions accepted vs edited, % of KB queries successful etc - all more useful than raw token count for quantifying how your org is using the tool.
So it’s a pragmatic metric that got a bit Goodhearted.
% of AI suggestions accepted vs. edited is also a BS metric that Anthropic et. al. like to push, similar to LoC, because they're large numbers and large numbers must be good, right?
Well guess what, I have auto-accept on and then adjust after it's "done". And I do it by telling it what changes to make and those have auto-accept on as well. That's quite a high "accept" rate, by definition. But in reality it may have churned on 50% of the lines it generated and auto-accepted first.
But I do think you also need to say, "To be clear, don't game the system. Any token usage that is even remotely justifiable as useful for the business is fine, and we will give you a lot of latitude. But if you're in the top 10% of token users, we are going to review your token usage, and if we find that you have a dozen agents perpetually running writing slam poetry, you're going to get fired."
this has to be the worst metric.
anytime the llm wants me to read a diff of one file, im just gonna send it forward so i can read the whole diff
It is oddly unprecendeted economic behavior.
We may be on the cusp of the AI age's new era of 'measure twice, cut once'.
When I'm working on code that was heavily vibecoded, most of my PRs are reducing LoC by a couple hundreds of lines while fixing bugs or implementing a new feature.
My job kind of feels like being a garbage man, luckily my current employer appreciates it. Personally I think the current style of vibecoding only kinda works, because models are getting better fast enough to keep the shitpile from overflowing completely. Betting on the harnesses + models getting good enough to clean up after themselves is a bet, and I don't like gambling, but even I admit the odds don't seem to be bad.
""" Steve Ballmer In IBM there's a religion in software that says you have to count K-LOCs, and a K-LOC is a thousand line of code. How big a project is it? Oh, it's sort of a 10K-LOC project. This is a 20K-LOCer. And this is 5OK-LOCs. And IBM wanted to sort of make it the religion about how we got paid. How much money we made off OS 2, how much they did. How many K-LOCs did you do? And we kept trying to convince them - hey, if we have - a developer's got a good idea and he can get something done in 4K-LOCs instead of 20K-LOCs, should we make less money? Because he's made something smaller and faster, less KLOC. K-LOCs, K-LOCs, that's the methodology. Ugh anyway, that always makes my back just crinkle up at the thought of the whole thing. """
And the tragedy is that this isn't sustainable, and we all involved deeply in tech know this. There is eventually going to be a big reality check the companies will have to pay, because you can't force creativity and quality, not even with AI, because actual intelligence lies with us at least for now and for the foreseeable future. However when the rope eventually snaps these executives at best will fall upwards, with big severance bonuses and a list of "contributions" we have to be grateful for. We are the ones that will suffer through the next big layoffs.
They call themselves "risk takers" to justify their high pay.
the companies will have to pay, because you can't force creativity and quality
Most companies do not care about quality.
_users_ who have to interact with that software will pay the price.Exemple from one of the wealthiest company in existance, for one of its most strategic product: I was trying gemini-cli on some mcp servers just yesterday, with gemini-chat helping me configuring everything. In less than 10 minutes, I stumbled upon 3 or 4 different bugs. Eventually, even gemini-chat recommended that I throw gemini-cli in the bin and move on to another agent... That's the new norm.
Have you seen the state of current corp software? I'd say a lot of creativity is still very much needed. Let's see how long this is sustainable.
> would anybody be really sad if this work is overtaken by LLMs?
I'd not be sad about the job itself, but the dev which had a mortgage to pay but now is substituted by a machine churning crap code while their superiors get sore from patting themselves on the back.
I know from personal experience that once you fix a bug introduced by Claude, Claude tries to recreate the bug every time he edits that code again!!
In cost per line of code, we have verified this is always an error unless your time is worth less than the machine (unlikely unless you consider your time to have no cost rather than considering it as your hourly rate).
The worst thing for our productivity has been Claude Code or Claude Cowork taking a complex problem and turning around and writing bad instructions for dumb model agents then synthesizing the dumb answers into an orchestra of badness.
The single best fix for results-per-total-cost is to ensure it reads and thinks about whole content, not snippets, and thinks with the smartest model, not agents.
Agents should toil. Agents should neither think*, nor decide what to think about which itself is thinking.
* Agents should “think” like ants or bees or beavers think. Any human-like thinking, *especially* intuition-like thinking, should be thought by the best model available.
** Nobody should be “churning out code”. In a hierarchy of coders who translate detailed specs to some computer language, developers who write software that ships on a project timeline, and engineers who accomplish business goals, engineers should “churn out” engines structured for business outcomes.
Measured by that, the machine is leverage while reducing a variety of costs. At the same time, because most training data doesn't grok this, the machine doesn't grok it either. So it needs you to shape its toil.
I don't care bout cost, I care about getting good results fast.
Cost per line of code is not a suitable metric for anything. It's as silly as measuring engineers' performance by lines of code. More lines of code is worse than fewer lines of code. When you say "we have verified" whoever that "we" is makes a big difference, but you're posting pseudonymously, how are we to even guess at that "we"?
I get better results with some older cheaper models, faster. In particular older Claude models than Opus 4.7. Maybe the more expensive model churns out more lines, more complexity faster. That is a worse outcome for me. The complexity must be avoided at all costs. The simpler, smaller, answer is always better, and scales to bigger code bases. The more the model guesses at intent rather than checking intent, the more the model is clever rather than clear and simple, the worse the outcome, the more that the model turns into an architecture astronaut, the worse the outcome.
Only cost for effective* outcome matters. And if your lines of code have a cost, you would want fewer lines of code to achieve the outcome, not more.
Are you sure you disagree with that?
* If your place of work starts talking "efficiency"**, run. Find somewhere the conversation is *effectiveness* — at the goal/outcome level.
** Not to mention that "efficiencies" is MBA speak for "right sizing" away effectiveness.
I haven't seen "just absorb a giant ball of context and do the right thing the first time" be cracked yet, even for Opus 4.7.
At the end of the day, code is code, and we have decades of lessons about how to make code more reliable and maintainable. Composable small modules, not god methods, are still the way to go, and they reward devs who use them to get focused context for agents with faster - and often better - results.
Exactly.
No more than sitting down and writing code before a product concept or spec or architecture comes out right the first time, or fifth.
Absorb the concept, make a shape of outcome, then a spec, then hold its hand to architect a series of iterations, either component by component or thin vertical slice or whatever combination lets you iterate in working increments...
Your brain, machine leverage. After all, it types faster than you. But it should type what you want.
You know what it should type, right? If you don't, you're gonna have a bad time anyway.
The whole industry is adjusting to the reality that the expected output of an engineer is much higher than it used to be. It’s not local to one company. You may find a better environment for the time being, but this is the direction everything is headed.
But I'd agree that everyone can start planning a career shift that'll span a few months to some years in order to seek better working conditions. Passively accepting all work degradation because that's life and money is needed is partly responsible for the current situation too.
Coding faster leads to less understanding and higher long-term risk. Source-Code amnesia is real, and there’s a time requirement to really understand and appreciate what a system is actually doing.
I’ve been able to implement very large features using frontier models, but the code needs to always be revisited.
AI can do two things: find vulnerabilities, and prototype code. It cannot design software, and any appearance of such is an illusion at best.
We don’t need to produce faster to be successful, we need to create better, long lasting products.
Copilot switches to API pricing starting next month (let's see how long it will last for our $39, and $19 since September), Anthropic switches all corps into API based pricing. From the most popular choices I think only Codex didn't switch yet (although it is hard to tell because I don't know their enterprise pricing).
Consumer sentiment is in the gutters certainly. But objective measures of the economy like unemployment and real wages look good to excellent
Curious what industry that is.
I've been getting by on the $200/year plan by smoothing usage continuously over time.
The pay per use is for the API so does it mean you're using the API in a custom setup?
When you consider that xAI's old data center was enough to bring Anthropic back ahead, it tells me Microsoft could host their own on underutilized previous gen GPUs that are sitting there wasting server real estate.
I don't buy it. Old models such as GPT4.1 were faster than newer reasoning models, and their output was as good. Newer models end up wasting an ungodly amount of time with chain-of-thought steps which can be a complete waste of time if you have a structured prompt such as a plan or a spec.
My experience in the real world is that users have to ration requests, and x0 models actually tend to be used far more because expensive models are left for more complex tasks.