Please do not A/B test my workflow (opens in new tab)

(backnotprop.com)

169 pointsramoz11d ago211 comments

211 comments

The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much. I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo. That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

SlinkyOnStairs11d ago

> I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo.

I disagree in the case of LLMs.

AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output".

It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around.

And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.

> That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review.

People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client.

sfn4211d ago

Anyone who trusts LLMs to do anything has shit coming. You can not trust them. If you do, that's on you. I don't care if you want to trust it to manage hiring, you can't. If you do anyway then the ethical problems are squarely on you.

People keep complaining about LLMs taking jobs, meanwhile others complain that they can't take their jobs and here I am just using them as a useful tool more powerful than a simple search engine and it's great. No chance it'll replace me, but it sure helps me do ny job better and faster.

DoctorOetker11d ago

Would you have a problem with the following scheme?

Every client is free and encouraged to feed back its financial health: profit for that hour/day/month/...

The AB(-X) test run by the LLM provider uses the correlation of a client's profit with its AB(-X) test, so that participating with the testing improves your profit statistically speaking (sometimes up sometimes down, but on average up).

You may say, what about that hiring decision? One thing is certain: when companies make more profit they are more likely to seek and accept more employees.

1 more reply

londons_explore11d ago

I think you would be hard pushed to find any big tech company which doesn't do some kind of A B testing. It's pretty much required if you want to build a great product.

3 more replies

steve-atx-760011d ago

Long term effectiveness? LLMs are such a fast moving target. Suppose anthropic reached out to you and gave you a model id you could pin down for the next year to freeze any a/b tests. Would you really want that? Next month a new model could be released to everyone else - or by a competitor - that’s a big step difference in performance in tasks you care about. You’d rather be on your own path learning about the state of the world that doesn’t exist anymore? nov-ish 2025 and after, for example, seemed like software engineering changed forever because of improvements in opus.

3 more replies

garciasn11d ago

> And this also pretty much ruins any attempt to research Claude Code's long term effectiveness in an organisation. Any negative result can now be thrown straight into the trash because of the chance Anthropic put you on the wrong side of an A/B test.

LLMs are non-deterministic anyway, as you note above with your comment on the 'reproducibility' issue. So; any sort of research into CC's long-term effectiveness would already have taken into account that you can run it 15x in a row and get a different response every time.

johnisgood11d ago

Then do not use LLMs for hiring, or use a specific LLM, or self-host your own!

airza11d ago

Isn’t the horrendous ethical and legal decision delegating your hiring process to a black box?

1 more reply

raw_anon_111111d ago

Would you rather they change things for everyone at once without testing?

1 more reply

simianwords11d ago

Strange! You benefitted from all the previous a/b experiments to give you a somewhat optimal model now. But now it’s too inconvenient for you?

2 more replies

ramozOP11d ago

I apologize for doing this - and I agree. I will revise

s3p11d ago

I still think you have a point here. Doing this kind of testing on users unwittingly is unethical in my opinion

everdrive11d ago

>I don't believe A/B testing is an inherent evil,

Evil might be a stretch, but I really hate A/B testing. Some feature or UI component you relied on is now different, with no warning, and you ask a coworker about it, and they have no idea what you're talking about.

Usually, the change is for the worse, but gets implemented anyway. I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.

cosmic_cheese11d ago

> I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.

In my experience all manner of analytics data frequently gets misused to support whatever narrative the product manager wants it to support.

With enough massaging you can make “objective” numbers say anything, especially if you do underhanded things like bury a previously popular feature three modals deep or put it behind a flag. “Oh would you look at that, nobody uses this feature any more! Must be safe to remove it.”

hollow-moe11d ago

Tech companies really have issues with "informed and conscious consent" doesn't they

mschuster9111d ago

> The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much.

No. Users aren't free test guinea pigs. A/B testing cannot be done ethically unless you actively point out to users that they are being A/B tested and offering the users a way to opt out, but that in turn ruins a large part of the promise behind A/B tests.

bcrl11d ago

Please name a computer science program that has an ethics component.

Yes, I wish software developers were more like actual engineers in this regard.

1 more reply

saltcured11d ago

Yeah, and if you don't already have an IRB, your organization probably isn't ready to be doing such things responsibly...

1 more reply

tomalbrc11d ago

Would love to know why you would consider invoking Meta “a little much”. Sounds more than appropriate.

krisbolton11d ago

Not to start an internet argument -- I don't think it is appropriate in this context. A/B testing the features of a web app is not unexpected or unethical. So invoking the memory of cambridge analytica (etc) is disproportionate. It's far more legitimate to just discuss how much A/B testing should negatively affect a user. I don't have an answer and it's an interesting and relevant question.

2 more replies

xg1511d ago

> The framing of A/B testing as a "silent experimentation on users"

Sorry, but how is A/B testing not exactly that? The experiments may be on non-disruptive things like button color, but they're experiments no less.

The users are also rarely informed about the experiment taking place, let alone on the motivation or evaluation criteria.

cyanydeez11d ago

Relying on a paid service for anything significant is basically accepting the Company Store feudal serfdom.

Enshittification is coming for AI.

chrislloyd11d ago

Hi, this was my test! The plan-mode prompt has been largely unchanged since the 3.x series models and now 4.x get models are able to be successful with far less direction. My hypothesis was that shortening the plan would decrease rate-limit hits while helping people still achieve similar outcomes. I ran a few variants, with the author (and few thousand others) getting the most aggressive, limiting the plan to 40 lines. Early results aren't showing much impact on rate limits so I've ended the experiment.

Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!

nextzck11d ago

The 40-line cap not moving rate limits makes sense - plan text is cheap. The cost is in Phase 1 exploration.

Plan mode spins up to 3 explore subagents before the planner even starts, and the heuristic is "use multiple when scope is uncertain." It won't choose fewer - it's being asked to plan, so scope is always uncertain. Nothing penalizes claude for over-exploring and nothing rewards restraint.

Plan mode also ignores session state. A cold start gets the same fanout as a warm session where the relevant files are already in context. In a warm session the explore pass is pure waste - it re-reads loaded files and feeds the planner lossy summaries that conflict with what it already knows.

More tokens, worse plan.

If exploration was conditional on what's already in context..skip it for warm sessions, keep it for cold starts - that does more for both rate limits and plan quality than a hard 40-line cap.

Note: plan mode didn’t always have this 3 subagent fan out behavior attached to it, it was introduced around opus 4.6 launch.

okwhateverdude11d ago

How can we opt-out of these tests? The behavior foibles I've been experiencing over the past month might be directly attributable to these experiments! It can be extreme frustrating. I don't want to be in the beta channel. Please change this to be opt-in.

ramozOP11d ago

Thanks for the transparency. Sorry for the noise.

I think I'd be okay with a smaller, more narrative-detailed plan - not so much about verbosity, more about me understanding what is about to happen & why. There hadn't been much discourse once planning mode entered (ie QA). It would jump into its own planning and idle until I saw only a set of projected code changes.

BAM-DevCrew11d ago

As a divergent thinker with extensive hard constraints in claude.mds and on-boarding commands that force claude to internalize my constraints, that you or some other employee of Anthropic could randomly select me for testing is horrifying. Each unexpected behavior and my corresponding reaction to it can wipe me out, my brain out, completely for hours, days, even weeks. I have in the last year spend tens (estimating around 400) of hours establishing and reestablishing a system to protect myself from psychological harm and financial harm. It is twisted that you Anthropic employees do not consider the impact your work has on divergent thinking Claude users, let alone that real work is severly impacted by your work. Totally irresponsible. Offensively so.

shepherdjerred11d ago

What?

Even without Anthropic's experimentation, anything in the context is completely probabilistic.

You cannot rely on it no matter how/how much you prompt the model

1 more reply

PufPufPuf11d ago

I can't tell whether something is satire anymore.

oakwhiz11d ago

Shouldn't you be giving people their tokens back when you used their tokens to test on their environment?

bartread11d ago

I don't mind you testing stuff out - it's the only sensible way to make the app better - but you need to give people choices to switch to different behaviours if the behaviour you're testing on them isn't working out well for them.

In other news, Claude Code login is down, so if you have time it would be sensible to proiritise fixing that:

Authorization failed Redirect URI http:/localhost:53025/callback is not supported by client.

MacOS Sequoia, VS Code 1.111.0, Firefox 147.0.4 (although also fails on Chrome 145.0.7632.160).

This just started happening as of this evening. I've tried restarting everything, and it doesn't help.

rusakov-field11d ago

On one side I am frustrated with LLMs because they derail you by throwing grammatically correct bullshit and hallucinations at you, where if you slip and entertain some of it momentarily it might slow you down.

But on the other hand they are so useful with boilerplate and connecting you with verbiage quickly that might guide you to the correct path quicker than conventional means. Like a clueless CEO type just spitballing terms they do not understand but still that nudging something in your thought process.

But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

qazxcvbnmlp11d ago

One of the main skills of using the llm well is knowing the difference between useful output and ai slop.

Mc_Big_G11d ago

>Those who think they will take over are clueless.

You're underestimating where it's headed.

rusakov-field11d ago

Do you think it will reach "understanding of semantics", true cognition, within our lifetimes ? Or performance indistinguishable from that even if not truly that.

Not sure. I am not so optimistic. People got intoxicated with nuclear powered cars , flying cars , bases on the moon ,etc all that technological euphoria from the 50's and 60's that never panned out. This might be like that.

I think we definitely stumbled on something akin to the circuitry in the brain responsible for building language or similar to it. We are still a long way to go until artificial cognition.

1 more reply

EMM_38611d ago

> But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

Or - there are enough people who know their stuff that the people who don't will be replaced and they will take over anyway.

risyachka11d ago

> there are enough people who know their stuff

unless the bar for "know their stuff" is very very low - this is not the case in the nearest future

gnfargbl11d ago

For anyone else wondering why the article ends in a non-sequitur: it looks like the author wrote about decompiling the Claude Code binaries and (presumably) discovering A/B testing paths in the code.

HN user 'onion2k pointed out that doing this breaks Anthropic's T&Cs: https://news.ycombinator.com/item?id=47375787

1 more reply

reconnecting11d ago

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

onion2k11d ago

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

The author's complaint doesn't really have anything to do with the LLM aspect of it though. They're complaining that the app silently changes what it's doing. In this case it's the injection of a prompt in a specific mode, but it could be anything really. Companies could use A/B tests on users to make Photoshop silently change the hue a user selects to be a little brighter, or Word could change the look of document titles, or a game could make enemies a bit stronger (fyi, this does actually happen - players get boosts on their first few rounds in online games to stop them being put off playing).

The complaint is about A/B tests with no visible warnings, not AI.

reconnecting11d ago

There's a distinction worth making here. A/B testing the interface button placement, hue of a UI element, title styling — is one thing. But you wouldn't accept Photoshop silently changing your #000000 to #333333 in the actual file. That's your output, not the UI around it. That's what LLMs do. The randomness isn't in the wrapper, it's in the result you take away.

1 more reply

duskdozer11d ago

Honestly I find it kind of surprising that anyone finds this surprising. This is standard practice for proprietary software. LLMs are very much not replicable anyway.

1 more reply

dkersten11d ago

Anthropic have done a lot of things that would give me pause about trusting them in a professional context. They are anything but transparent, for example about the quota limits. Their vibe coded Claude code cli releases are a buggy mess too. Also the model quality inconsistency: before a new model release, there’s a week or two where their previous model is garbage.

A/B testing is fine in itself, you need to learn about improvements somehow, but this seems to be A/B testing cost saving optimisations rather than to provide the user with a better experience. Less transparency is rarely good.

This isn’t what I want from a professional tool. For business, we need consistency and reliability.

r_lee11d ago

> vibe coded Claude code cli releases are a buggy mess too

this is what gets me.

are they out of money? are so desperate to penny pinch that they can't just do it properly?

what's going on in this industry?

2 more replies

ordersofmag11d ago

Any tool that auto-updates carries the implication that behavior will change over time. And one criteria for being a skilled professional is having expert understanding of ones tools. That includes understanding the strengths and weaknesses of the tools (including variability of output) and making appropriate choices as a result. If you don't feel you can produce professional code with LLM's then certainly you shouldn't use them. That doesn't mean others can't leverage LLM's as part of their process and produce professional results. Blindly accepting LLM output and vibe coding clearly doesn't consistently product professional results. But that's different than saying professionals can't use LLM in ways that are productive.

johnisgood11d ago

Well put. I would upvote this many times if I could.

hrmtst9383711d ago

Replicability is a spectrum not a binary and if you bake in enough eval harnessing plus prompt control you can get LLMs shockingly close to deterministic for a lot of workloads. If the main blocker for "professional" use was unpredictability the entire finance sector would have shutdown years ago from half the data models and APIs they limp along on daily.

Mtinie11d ago

What would you do differently if LLM outputs were deterministic?

Perhaps I approach this from a different perspective than you do, so I’m interested to understand other viewpoints.

I review everything that my models produce the same way I review work from my coworkers: Trust but verify.

WillAdams11d ago

Yeah, I've been using Copilot to process scans of invoices and checks (w/ a pen laid across the account information) converted to a PDF 20 at a time and it's pretty rare for it to get all 20, but it's sufficiently faster than opening them up in batches of 50 and re-saving using the Invoice ID and then using a .bat file to rename them (and remembering to quite Adobe Acrobat after each batch so that I don't run into the bug in it where it stops saving files after a couple of hundred have been so opened and re-saved).

danielbln11d ago

I don't get your point. Web tools have been doing A/B feature testing all the time, way before we had LLMs.

reconnecting11d ago

This is very different from the A/B interface testing you're referring to, what LLMs enable is A/B testing the tool's own output — same input, different result.

Your compiler doesn't do that. Your keyboard doesn't do that. The randomness is inside the tool itself, not around it. That's a fundamental reliability problem for any professional context where you need to know that input X produces output X, every time.

5 more replies

freeone300011d ago

Yes! And it was bad then too!!

I want software that does a specific list of things, doesn’t change, and preferentially costs a known amount.

_heimdall11d ago

LLMs are nondeterministic by design, but that has nothing to do with A/B testing.

NotGMan11d ago

By that definition humans are not professional since we hallucinate and make mistakes all the time.

croes11d ago

That’s not a problem of LLMs but of using services provided by others.

How often were features changed or deactivated by cloud services?

bushido11d ago

I have no issues with A/B tests.

I do have an issue with the plan mode. And nine out of ten times, it is objectively terrible. The only benefit I've seen in the past from using plan mode is it remembers more information between compactions as compared to the vanilla - non-agent team workflow.

Interestingly, though, if you ask it to maintain a running document of what you're discussing in a markdown file and make it create an evergreen task at the top of its todo list which references the markdown file and instructs itself to read it on every compaction, you get much better results.

mikkupikku11d ago

Huh, very much not my experience with plan mode. I use plan mode before almost anything more than truly trivial task because I've found it to be far more efficient. I want a chance to see and discuss what claude is planning to do before it races off and does the thing, because there are often different approaches and I only sometimes agree with the approach claude would decide on by itself.

bushido11d ago

Planning is great. It's plan mode that is unpredictable in how it discusses it and what it remembers from the discussion.

I still have discussions with the agents and agent team members. I just force it to save it in a document in the repo itself and refer back to the document. You can still do the nice parts of clearing context, which is available with plan mode, but you get much better control.

At all times, I make the agents work on my workflow, not try and create their own. This comes with a whole lot of trial and error, and real-life experience.

There are times when you need a tiger team made up of seniors. And others when you want to give a overzealous mid-level engineer who's fast a concrete plan to execute an important feature in a short amount of time.

I'm putting it in non-AI terms because what happens in real life pre-AI is very much what we need to replicate with AI to get the best results. Something which I would have given a bigger team to be done over two to eight sprints will get a different workflow with agent teams or agents than something which I would give a smaller tiger team or a single engineer.

They all need a plan. For me plan mode is insufficient 90% of the times.

I can appreciate that many people will not want to mess around with workflows as much as I enjoy doing.

andrewaylett11d ago

> on every compaction

I've only hit the compaction limit a handful of times, and my experience degraded enough that I work quite hard to not hit it again.

One thing I like about the current implementation of plan mode is that it'll clear context -- so if I complete a plan, I can use that context to write the next plan without growing context without bound.

samdjstephens11d ago

I really like this too - having the previous plan and implementation in place to create the next plan, but then clearing context once that next plan exists feels like a great way to have exactly the right context at the right time.

I often do follow ups, that would have been short message replies before, as plans, just so I can clear context once it’s ready. I’m hitting the context limit much less often now too.

mikkupikku11d ago

Agreed. The only time I don't clear context after a plan has been agreed on is when I'm doing a long series of relatively small but very related changes, such as back-and-forth tweaking when I don't yet know what I really want the final result to be until I've tried stuff out. In those cases, it has very rarely been useful to compact the context, but usually I don't get close.

johnisgood11d ago

Apparently the blog stripped the decompilation details for ToS reasons, which sucks because those are exactly the hack-y bits that make this interesting for HN.

> It told me it was following specific system instructions to hard-cap plans at 40 lines, forbid context sections, and “delete prose, not file paths.

Yeah, would be nice to be able to view and modify these instructions.

vova_hn211d ago

Two thoughts:

1. Open source tools solve the problem of "critical functions of the application changing without notice, or being signed up for disruptive testing without opt-in".

2. This makes me afraid that it is absolutely impossible for open source tools to ever reach the level of proprietary tools like Claude Code precisely because they cannot do A/B tests like this which means that their design decisions are usually informed by intuition and personal experience but not by hard data collected at scale.

dijit11d ago

Regarding point 1 specifically, there were so many people seriously miffed at the “man after midnight”[0] time-based easter egg that I would be careful with that reasoning.

Open source doesn’t always mean reproducible.

People don’t enjoy the thought of auditing code… someone else will do it; and its made somewhat worse with our penchant to pull in half the universe as dependencies (Rust, Go and Javascript tend to lean in this direction to various extremes). But auditing would be necessary in order for your first point here to be as valid as you present.

[0]: https://gitlab.com/man-db/man-db/-/commit/002a6339b1fe8f83f4...

vova_hn211d ago

> People don’t enjoy the thought of auditing code… someone else will do it

I think that with modern LLMs auditing a big project personally, instead of relying on someone else to do it, actually became more realistic.

You can ask an LLM to walk you through the code, highlight parts that seem unusual or suspicious, etc.

On the other hand, LLMs also made producing code cheaper then ever, so you can argue, that big projects will just become even bigger wich will put them out of reach even for a reviewer who is also armed with an LLM.

1 more reply

BiteCode_dev11d ago

Let's A/B test the linux kernel, for shits and giggles.

alpaca12811d ago

A/B test doesn't necessarily imply improvements for the user. It could be testing of future enshittification methods. See YouTube for an example.

Havoc11d ago

Moved from CC to opencode a couple months ago because the vibes were not for me. Not bad per se but a bit too locked in and when I was looking at the raw prompts it was sending down the wire it was also quite lets call it "opinionated".

Plus things like not being able to control where the websearches go.

That said I have the luxury of being a hobbyist so I can accept 95% of cutting edge results for something more open. If it was my job I can see that going differently.

dvfjsdhgfv11d ago

Can you share a setup that works for you? I found vanilla opencode vastly inferior to CC, I use it only for little toys like 3 small files that's all.

Havoc11d ago

Don't think I'm doing anything particularly novel.

Using a mix of models - GLM5, MinMax 2.5 and Claude Sonnet/Opus - they find different issues

Spending fair bit of time in spec'ing things out and running all three models over it to suggest improvements / flaws & iterating till all three are happy. Same at end - look at code & suggest stability improvements. The actual writing code is GLM5 - once properly spec'd out it can generally just hammer away at it till its done

And doing a lot of microservice style architecture. Think chains of containers talking to each other over APIs

himata411311d ago

I have noticed opus doing A/B testing since the performance varies greatly. While looking for jailbreaks I have discovered that if you put a neurotoxin chemical composition into your system prompt it will default to a specific variant of the model presumeably due to triggering some kind of safety. Might put you on a watchlist so ymmv.

terralumen11d ago

Curious what the A/B test actually changed -- the article mentions tool confirmation dialogs behaving inconsistently, which lines up with what I noticed last week. Would be nice if Anthropic published a changelog or at least flagged when behavior is being tested.

ramozOP11d ago

This stemmed from me asking Claude itself why it was writing such _weird_ plans with no detail (just a bunch of projected code changes).

Claude stated: in its system prompt, it had strict instructions to provide no context or details. Keep plans under forty lines of code. Be terse.

doc_ick10d ago

This is Claude’s output of its system prompt, can you verify without going Claude of the system prompt? There is still potential of hallucination.

2 more replies

rahimnathwani11d ago

If you want your coding harness to be predictable, then use something open source, like Pi:

https://pi.dev/

https://github.com/badlogic/pi-mono/tree/main/packages/codin...

But if you want to use it with Claude models you will have to pay per token (Claude subscriptions are only for use with Claude's own harnesses like claude code, the Claude desktop app, and the Claude Excel/Powerpoint extensions).

bartread11d ago

There’s more than a bit of irony in the author complaining about A/B testing and then, because they’re getting a lot of traffic and attention on HN, removing key content that was originally in their piece so some of us have seen it but many of us won’t.

Whilst I broadly agree with their point, colour me unimpressed by this behaviour.

EDIT: God bless archive.org: https://web.archive.org/web/20260314105751/https://backnotpr.... This provides a lot more useful insight that, to me, significantly strengthens the point the article is making. Doesn’t mean I’m going to start picking apart binaries (though it wouldn’t be the first time), but how else are you supposed to really understand - and prove - what’s going on unless you do what the author did? Point is, it’s a much better, more useful, and more interesting article in its uncensored form.

EDIT 2: For me it’s not the fact that Anthropic are doing these tests that’s the problem: it’s that they’re not telling us, and they’re not giving us a way to select a different behaviour (which, if they did, would also give them useful insights into users needs).

shawnz11d ago

While I agree with the sentiment here, you might be interested to see that there are a couple hack approaches to override Claude Code feature flags:

https://github.com/anthropics/claude-code/issues/21874#issue...

https://gist.github.com/gastonmorixe/9c596b6de1095b6bd3b746c...

pshirshov11d ago

> I pay $200/month for Claude Code

Which is still very cheap. There are other options, local Qwen 3.5 35b + claude code cli is, in my opinion, comparable in quality with Sonnet 4..4.5 - and without a/b tests!

sunaookami11d ago

In what world is 200$ per month cheap?

raw_anon_111111d ago

The last time I did contract work when I was between jobs I made $100/hour.

And I won’t say how much my employer charges for me. But you can see how much the major consulting companies charge here

https://ceriusexecutives.com/management-consultants-whats-th...

1 more reply

pshirshov11d ago

Where the value you extract out of the model is orders of magnitude higher than the price of 2..6 hours of your time.

Kiro11d ago

It's not cheap but it's also not unusual for devs to burn $200 a day on tokens.

jfarmer11d ago

Seems like a straightforward solution would be to get people to opt-in by offering them credits, increased limits, early access to new features, etc.

Universities have IRBs for good reasons.

Aaargh2031811d ago

A problem with this approach could be that you're now only testing the feature with the kind of people who would sign up for an A/B test. This group may not be representative of your whole user-base.

jfarmer11d ago

So they'd need more robust experimental designs and statistical methods. They exist.

And unlike the university context, there’s a glut of data.

A basic technique: https://en.wikipedia.org/wiki/Inverse_probability_weighting

Or https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4384809

helsinkiandrew11d ago

Presumably Anthropic has to make lots of choices on how much processing each stage of Claude Code uses - if they maxed everything out, they'd make more of a loss/less of a profit on each user - $200/month would cost $400/month.

Doing A/B tests on each part of the process to see where to draw the line (perhaps based on task and user) would seem a better way of doing it than arbitrarily choosing a limit.

phreeza11d ago

Seems completely unsurprising?

dvfjsdhgfv11d ago

For those confused about this submission: the original post is here:

https://web.archive.org/web/20260314105751/https://backnotpr...

sigbottle11d ago

OHHHH. That actually explains a lot why CC was going to shit recently. Was genuinely frustrated with that.

il-b11d ago

This has been called a “vendor lock” for as long as I can remember. Congratulations, you’ve put yourself in a position where you can’t do your job without a single tool that can degrade or disappear at any moment.

ralferoo11d ago

It seems a bit odd to complain "I need transparency into how it works and the ability to configure it" when his workflow is already relying on a black box with zero transparency into how it works.

Cyphase11d ago

There's a difference between "LLMs are inherently black boxes that require lots of work to attempt to understand" and explicitly changing how a piece of software works.

Should people not complain about unannounced changes to the contents of their food or medicine because we don't understand everything about how the human body works?

ralferoo11d ago

Except the system prompt that gets prepended to your own prompt is part of the black box, and obviously should be expected to change over time. You are also told that you're not allowed to reverse engineer it. Even in the absence of the system prompt being changed, the output of the LLM is non-deterministic.

I'm not sure I understand your last analogy. How would changes to the human body change the contents of the food that is eaten? It would be more analogous to compare it with unexpected changes to the body's output given the same inputs as previously, a phenomenon humans frequently experience.

2 more replies

pinum11d ago

Here’s the original article which was much more informative and interesting:

https://web.archive.org/web/20260314105751/https://backnotpr...

Can’t believe HN has become so afraid of generic probably-unenforceable “plz don’t reverse engineer” EULAs. We deserve to know what these tools are doing.

I’ve seen poor results from plan mode recently too and this explains a lot.

cube0011d ago

Doesn't stop them going to your employer and that hint of you doing something iffy is enough to claim you're bringing the company into disrepute by drawing unwanted attention.

vova_hn211d ago

> probably-unenforceable

It's very easy to just ban the user and if your whole workflow relies on the tool, you really don't want it.

dep_b11d ago

I think stable API versions are going to be really big. I’d rather have known bugs u can work around than waking up to whatever thing got fixed that made another thing behave differently.

skeledrew11d ago

How does the product improve in such a case?

letier11d ago

They do show me “how satisfied are you with claude code today?” regularly, which can be seen as a hint. I did opt out of helping to improve claude after all.

Razengan11d ago

I knew it: https://news.ycombinator.com/item?id=47274796

belabartok3911d ago

How else are they supposed to get an authentic user test? Doctors use placebos because it doesn't work if the user knows about it.

jruz11d ago

I use stable and is the same, can't wait for Codex to offer a $100 plan I would switch in an instant

casey211d ago

This blog looks like an ad for Claude, all it's posts are about Claude and it was made in 2026

heliumtera11d ago

Someone else has the complete power over your workflow, then it's not as yours as you claim.

0gs11d ago

i'm sure your entitlement to 24/7 uptime of a single unchanging product version, no experiments/releases/new features etc., is clearly outlined in the ToS you agreed to. just sue them?

cerved11d ago

Is the a b test tired to the installation or the user?

nemo44x11d ago

They lose money at $200/month in most cases. Again, the old rules still apply. You are the product.

simonw11d ago

I'm confident "in most cases" is not correct there. If they lose money on the $200/month plan it's only with a tiny portion of users.

skeledrew11d ago

I've started looking into this. I'm unsure how exactly to interpret the "cost" data that can be added to statusline, but I'm on the Pro plan and have noticed that it's reporting ~$100 cost across projects I've used it on. For a week, which means I'm getting ~$200 worth for $20 in a month. That's immense value even if it's fairly off, and unless there are people paying for a subscription and using for a couple days in a month... don't want to contemplate it too much TBH given that I'm benefiting so much.

gruez11d ago

>They lose money at $200/month in most cases.

Source? Every time I see claims on profitability it's always hand wavy justifications.

lwhi11d ago

'Hand wavy' is one of my LLMs favourite terms.

nemo44x11d ago

There’s a lot of articles about it. It costs them $500+ for heavy users. They do this to capture market share and also to train their agent loops with human reinforcement learning.

https://ezzekielnjuguna.medium.com/why-anthropic-is-practica...

1 more reply

cebert11d ago

This is really frustrating.

mvrckhckr11d ago

I think it’s dishonest to use a paying client as a test subject for fundamental functionality they pay for, without their prior consent.

handfuloflight11d ago

The ToS you agreed to gives Anthropic the right to modify the product at any time to improve it. Did you have your agent explain that to you, or did you assume a $200 subscription meant a frozen product?

ramozOP11d ago

I understand. Just with AI, I don't think the behavior should change so drastically. Which I understand is paradoxical because we enjoy it when it can 10x or 1000x our workflow. I think responsible AI includes more transparency and capability control.

doc_ick11d ago

You rent ai, you don’t own it (unless you self host).

witx11d ago

That ship has sailed. These models were trained unethically on stollen data, they pollute tremendously and are causing a bubble that is hurting people.

"Responsible" and "Ethic" are faaar gone.

onion2k11d ago

Section 6.b of the Claude Code terms says they can and will change the product offering from time to time, and I imagine that means on a user segment basis rather than any implied guarantee that everyone gets the same thing.

b. Subscription content, features, and services. The content, features, and other services provided as part of your Subscription, and the duration of your Subscription, will be described in the order process. We may change or refresh the content, features, and other services from time to time, and we do not guarantee that any particular piece of content, feature, or other service will always be available through the Services.

It's also worth noting that section 3.3 explicitly disallows decompilation of the app.

To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.

Always read the terms. :)

embedding-shape11d ago

> To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.

Luckily, it doesn't seem like any service was reverse-engineered or decompiled here, only a software that lived on the authors disk.

onion2k11d ago

Again, read the terms. Service has a specific meaning, and it isn't what you're assuming.

Don't assume things about legal docs. You will often be wrong. Get a lawyer if it's something important.

1 more reply

applfanboysbgon11d ago

Not "service" in human speech. Service, in bullshit legalese. They define their software as

> along with any associated apps, software, and websites (together, our “Services”)

As far as I understand, these terms actually hold up in court, too. Which is complete fucking nonsense that, I think, could only be the result of a technologically illiterate class making the decisions. Being penalised for trying to understand what software is doing on your machine is so wholly unreasonable that it should not be a valid contractual term.

2 more replies

doc_ick11d ago

“ I dug into the Claude Code binary.”

ozgrakkurt11d ago

Why should anyone care about their TOS while they are laundering people’s work at a massive scale?

mcherm11d ago

There are a bunch of reasons.

Perhaps their TOS involves additional evils they are performing in the world, and it would be good to know about that.

Perhaps their TOS is restricting the US military from misusing the product and create unmonitored killbots.

Perhaps the person (as I do) does not feel that "laundering people's work at a massive scale" is unethical, any more than using human knowledge is unethical when those humans were allowed to spend decades reading copyrighted material in and out of school and most of what the human knows is derived from those materials and other conversations with people who didn't sign release forms before conversing.

Just because you think one thing is bad about someone doesn't mean no one should ever discuss any other topic about them.

duskdozer11d ago

Because by contrast they have the money and institutional capture to make your life miserable if you don't.

1 more reply

pixl9711d ago

When a company tells you not to reverse, decompile, or disassemble their software, the first thing you should do is just that.

BAM-DevCrew11d ago

Maybe I did. Maybe I didn't.

doc_ick11d ago

^ this, I was about to double check on it when I saw you did. None of these practices sound abnormal, maybe a little sketchy but that comes with using llms.

mrgoldenbrown11d ago

So if I view source on their webpage I'm violating terms and conditions? Yikes.

ramozOP11d ago

I understand. Thank you for sharing. I didn't uncover all of this until Claude told me its specific system instructions when I asked it to conduct introspection. I'll revise the blog so that I don't encourage anybody else to do deeper introspection with the tool.

BAM-DevCrew11d ago

As a divergent thinker who is harmed when Claude behaves in unpredictable manners that go counter to my extensive harm prevention protocols, I may have, or may not have, done deep investigation of the tool in order to understand how to create my harm prevention protocols. When Anthropic employees push out unstable work, developers in general are significantly impacted. When unstable products end up in my workflow I am harmed both financially AND psychologically. I can lose hours, days, even weeks by an unstable model or IDE. I should not EVER be tested on. And if maybe diving into their product protects me, so be it.

shepherdjerred11d ago

We’re still on _Hacker_ News, right?

j / k navigate · click thread line to collapse

211 comments

krisbolton11d ago

SlinkyOnStairs11d ago

> I don't believe A/B testing is an inherent evil, you need to get the test design right, and that would be better framing for the post imo.

I disagree in the case of LLMs.

AI already has a massive problem in reproducibility and reliability, and AI firms gleefully kick this problem down to the users. "Never trust it's output".

It's already enough of a pain in the ass to constrain these systems without the companies silently changing things around.

> That being said, vastly reducing an LLMs effectiveness as part of an A/B test isn't acceptable which appears to be the case here.

The open question here is whether or not they were doing similar things to their other products. Claude Code shitting out a bad function is annoying but should be caught in review.

People use LLMs for things like hiring. An undeclared A-B test there would be ethically horrendous and a legal nightmare for the client.

sfn4211d ago

DoctorOetker11d ago

Would you have a problem with the following scheme?

Every client is free and encouraged to feed back its financial health: profit for that hour/day/month/...

You may say, what about that hiring decision? One thing is certain: when companies make more profit they are more likely to seek and accept more employees.

1 more reply

londons_explore11d ago

I think you would be hard pushed to find any big tech company which doesn't do some kind of A B testing. It's pretty much required if you want to build a great product.

3 more replies

steve-atx-760011d ago

3 more replies

garciasn11d ago

johnisgood11d ago

Then do not use LLMs for hiring, or use a specific LLM, or self-host your own!

airza11d ago

Isn’t the horrendous ethical and legal decision delegating your hiring process to a black box?

1 more reply

raw_anon_111111d ago

Would you rather they change things for everyone at once without testing?

1 more reply

simianwords11d ago

Strange! You benefitted from all the previous a/b experiments to give you a somewhat optimal model now. But now it’s too inconvenient for you?

2 more replies

ramozOP11d ago

I apologize for doing this - and I agree. I will revise

s3p11d ago

I still think you have a point here. Doing this kind of testing on users unwittingly is unethical in my opinion

everdrive11d ago

>I don't believe A/B testing is an inherent evil,

cosmic_cheese11d ago

> I'm sure the teams responsible have "objective" "data" which "proves" it's the right direction, but the reality of it is often the opposite.

In my experience all manner of analytics data frequently gets misused to support whatever narrative the product manager wants it to support.

hollow-moe11d ago

Tech companies really have issues with "informed and conscious consent" doesn't they

mschuster9111d ago

> The framing of A/B testing as a "silent experimentation on users" and invoking Meta is a little much.

bcrl11d ago

Please name a computer science program that has an ethics component.

Yes, I wish software developers were more like actual engineers in this regard.

1 more reply

saltcured11d ago

Yeah, and if you don't already have an IRB, your organization probably isn't ready to be doing such things responsibly...

1 more reply

tomalbrc11d ago

Would love to know why you would consider invoking Meta “a little much”. Sounds more than appropriate.

krisbolton11d ago

2 more replies

xg1511d ago

> The framing of A/B testing as a "silent experimentation on users"

Sorry, but how is A/B testing not exactly that? The experiments may be on non-disruptive things like button color, but they're experiments no less.

The users are also rarely informed about the experiment taking place, let alone on the motivation or evaluation criteria.

cyanydeez11d ago

Relying on a paid service for anything significant is basically accepting the Company Store feudal serfdom.

Enshittification is coming for AI.

chrislloyd11d ago

Planning serves two purposes - helping the model stay on track and helping the user gain confidence in what the model is about to do. Both sides of that are fuzzy, complex and non-obvious!

nextzck11d ago

The 40-line cap not moving rate limits makes sense - plan text is cheap. The cost is in Phase 1 exploration.

More tokens, worse plan.

If exploration was conditional on what's already in context..skip it for warm sessions, keep it for cold starts - that does more for both rate limits and plan quality than a hard 40-line cap.

Note: plan mode didn’t always have this 3 subagent fan out behavior attached to it, it was introduced around opus 4.6 launch.

okwhateverdude11d ago

ramozOP11d ago

Thanks for the transparency. Sorry for the noise.

BAM-DevCrew11d ago

shepherdjerred11d ago

What?

Even without Anthropic's experimentation, anything in the context is completely probabilistic.

You cannot rely on it no matter how/how much you prompt the model

1 more reply

PufPufPuf11d ago

I can't tell whether something is satire anymore.

oakwhiz11d ago

Shouldn't you be giving people their tokens back when you used their tokens to test on their environment?

bartread11d ago

In other news, Claude Code login is down, so if you have time it would be sensible to proiritise fixing that:

Authorization failed Redirect URI http:/localhost:53025/callback is not supported by client.

MacOS Sequoia, VS Code 1.111.0, Firefox 147.0.4 (although also fails on Chrome 145.0.7632.160).

This just started happening as of this evening. I've tried restarting everything, and it doesn't help.

rusakov-field11d ago

But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

qazxcvbnmlp11d ago

One of the main skills of using the llm well is knowing the difference between useful output and ai slop.

Mc_Big_G11d ago

>Those who think they will take over are clueless.

You're underestimating where it's headed.

rusakov-field11d ago

Do you think it will reach "understanding of semantics", true cognition, within our lifetimes ? Or performance indistinguishable from that even if not truly that.

I think we definitely stumbled on something akin to the circuitry in the brain responsible for building language or similar to it. We are still a long way to go until artificial cognition.

1 more reply

EMM_38611d ago

> But you REALLY need to know your stuff to begin with for they to be of any use. Those who think they will take over are clueless.

Or - there are enough people who know their stuff that the people who don't will be replaced and they will take over anyway.

risyachka11d ago

> there are enough people who know their stuff

unless the bar for "know their stuff" is very very low - this is not the case in the nearest future

gnfargbl11d ago

For anyone else wondering why the article ends in a non-sequitur: it looks like the author wrote about decompiling the Claude Code binaries and (presumably) discovering A/B testing paths in the code.

HN user 'onion2k pointed out that doing this breaks Anthropic's T&Cs: https://news.ycombinator.com/item?id=47375787

1 more reply

reconnecting11d ago

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

onion2k11d ago

A professional tool is something that provides reliable and replicable results, LLMs offer none of this, and A/B testing is just further proof.

The complaint is about A/B tests with no visible warnings, not AI.

reconnecting11d ago

1 more reply

duskdozer11d ago

Honestly I find it kind of surprising that anyone finds this surprising. This is standard practice for proprietary software. LLMs are very much not replicable anyway.

1 more reply

dkersten11d ago

This isn’t what I want from a professional tool. For business, we need consistency and reliability.

r_lee11d ago

> vibe coded Claude code cli releases are a buggy mess too

this is what gets me.

are they out of money? are so desperate to penny pinch that they can't just do it properly?

what's going on in this industry?

2 more replies

ordersofmag11d ago

johnisgood11d ago

Well put. I would upvote this many times if I could.

hrmtst9383711d ago

Mtinie11d ago

What would you do differently if LLM outputs were deterministic?

Perhaps I approach this from a different perspective than you do, so I’m interested to understand other viewpoints.

I review everything that my models produce the same way I review work from my coworkers: Trust but verify.

WillAdams11d ago

danielbln11d ago

I don't get your point. Web tools have been doing A/B feature testing all the time, way before we had LLMs.

reconnecting11d ago

This is very different from the A/B interface testing you're referring to, what LLMs enable is A/B testing the tool's own output — same input, different result.

5 more replies

freeone300011d ago

Yes! And it was bad then too!!

I want software that does a specific list of things, doesn’t change, and preferentially costs a known amount.

_heimdall11d ago

LLMs are nondeterministic by design, but that has nothing to do with A/B testing.

NotGMan11d ago

By that definition humans are not professional since we hallucinate and make mistakes all the time.

croes11d ago

That’s not a problem of LLMs but of using services provided by others.

How often were features changed or deactivated by cloud services?

bushido11d ago

I have no issues with A/B tests.

mikkupikku11d ago

bushido11d ago

Planning is great. It's plan mode that is unpredictable in how it discusses it and what it remembers from the discussion.

At all times, I make the agents work on my workflow, not try and create their own. This comes with a whole lot of trial and error, and real-life experience.

They all need a plan. For me plan mode is insufficient 90% of the times.

I can appreciate that many people will not want to mess around with workflows as much as I enjoy doing.

andrewaylett11d ago

> on every compaction

I've only hit the compaction limit a handful of times, and my experience degraded enough that I work quite hard to not hit it again.

samdjstephens11d ago

I often do follow ups, that would have been short message replies before, as plans, just so I can clear context once it’s ready. I’m hitting the context limit much less often now too.

mikkupikku11d ago

johnisgood11d ago

Apparently the blog stripped the decompilation details for ToS reasons, which sucks because those are exactly the hack-y bits that make this interesting for HN.

> It told me it was following specific system instructions to hard-cap plans at 40 lines, forbid context sections, and “delete prose, not file paths.

Yeah, would be nice to be able to view and modify these instructions.

vova_hn211d ago

Two thoughts:

1. Open source tools solve the problem of "critical functions of the application changing without notice, or being signed up for disruptive testing without opt-in".

dijit11d ago

Regarding point 1 specifically, there were so many people seriously miffed at the “man after midnight”[0] time-based easter egg that I would be careful with that reasoning.

Open source doesn’t always mean reproducible.

[0]: https://gitlab.com/man-db/man-db/-/commit/002a6339b1fe8f83f4...

vova_hn211d ago

> People don’t enjoy the thought of auditing code… someone else will do it

I think that with modern LLMs auditing a big project personally, instead of relying on someone else to do it, actually became more realistic.

You can ask an LLM to walk you through the code, highlight parts that seem unusual or suspicious, etc.

1 more reply

BiteCode_dev11d ago

Let's A/B test the linux kernel, for shits and giggles.

alpaca12811d ago

A/B test doesn't necessarily imply improvements for the user. It could be testing of future enshittification methods. See YouTube for an example.

Havoc11d ago

Plus things like not being able to control where the websearches go.

That said I have the luxury of being a hobbyist so I can accept 95% of cutting edge results for something more open. If it was my job I can see that going differently.

dvfjsdhgfv11d ago

Can you share a setup that works for you? I found vanilla opencode vastly inferior to CC, I use it only for little toys like 3 small files that's all.

Havoc11d ago

Don't think I'm doing anything particularly novel.

Using a mix of models - GLM5, MinMax 2.5 and Claude Sonnet/Opus - they find different issues

And doing a lot of microservice style architecture. Think chains of containers talking to each other over APIs

himata411311d ago

terralumen11d ago

ramozOP11d ago

This stemmed from me asking Claude itself why it was writing such _weird_ plans with no detail (just a bunch of projected code changes).

Claude stated: in its system prompt, it had strict instructions to provide no context or details. Keep plans under forty lines of code. Be terse.

doc_ick10d ago

This is Claude’s output of its system prompt, can you verify without going Claude of the system prompt? There is still potential of hallucination.

2 more replies

rahimnathwani11d ago

If you want your coding harness to be predictable, then use something open source, like Pi:

https://pi.dev/

https://github.com/badlogic/pi-mono/tree/main/packages/codin...

bartread11d ago

Whilst I broadly agree with their point, colour me unimpressed by this behaviour.

shawnz11d ago

While I agree with the sentiment here, you might be interested to see that there are a couple hack approaches to override Claude Code feature flags:

https://github.com/anthropics/claude-code/issues/21874#issue...

https://gist.github.com/gastonmorixe/9c596b6de1095b6bd3b746c...

pshirshov11d ago

> I pay $200/month for Claude Code

Which is still very cheap. There are other options, local Qwen 3.5 35b + claude code cli is, in my opinion, comparable in quality with Sonnet 4..4.5 - and without a/b tests!

sunaookami11d ago

In what world is 200$ per month cheap?

raw_anon_111111d ago

The last time I did contract work when I was between jobs I made $100/hour.

And I won’t say how much my employer charges for me. But you can see how much the major consulting companies charge here

https://ceriusexecutives.com/management-consultants-whats-th...

1 more reply

pshirshov11d ago

Where the value you extract out of the model is orders of magnitude higher than the price of 2..6 hours of your time.

Kiro11d ago

It's not cheap but it's also not unusual for devs to burn $200 a day on tokens.

jfarmer11d ago

Seems like a straightforward solution would be to get people to opt-in by offering them credits, increased limits, early access to new features, etc.

Universities have IRBs for good reasons.

Aaargh2031811d ago

A problem with this approach could be that you're now only testing the feature with the kind of people who would sign up for an A/B test. This group may not be representative of your whole user-base.

jfarmer11d ago

So they'd need more robust experimental designs and statistical methods. They exist.

And unlike the university context, there’s a glut of data.

A basic technique: https://en.wikipedia.org/wiki/Inverse_probability_weighting

Or https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4384809

helsinkiandrew11d ago

Doing A/B tests on each part of the process to see where to draw the line (perhaps based on task and user) would seem a better way of doing it than arbitrarily choosing a limit.

phreeza11d ago

Seems completely unsurprising?

dvfjsdhgfv11d ago

For those confused about this submission: the original post is here:

https://web.archive.org/web/20260314105751/https://backnotpr...

sigbottle11d ago

OHHHH. That actually explains a lot why CC was going to shit recently. Was genuinely frustrated with that.

il-b11d ago

ralferoo11d ago

It seems a bit odd to complain "I need transparency into how it works and the ability to configure it" when his workflow is already relying on a black box with zero transparency into how it works.

Cyphase11d ago

There's a difference between "LLMs are inherently black boxes that require lots of work to attempt to understand" and explicitly changing how a piece of software works.

Should people not complain about unannounced changes to the contents of their food or medicine because we don't understand everything about how the human body works?

ralferoo11d ago

2 more replies

pinum11d ago

Here’s the original article which was much more informative and interesting:

https://web.archive.org/web/20260314105751/https://backnotpr...

Can’t believe HN has become so afraid of generic probably-unenforceable “plz don’t reverse engineer” EULAs. We deserve to know what these tools are doing.

I’ve seen poor results from plan mode recently too and this explains a lot.

cube0011d ago

Doesn't stop them going to your employer and that hint of you doing something iffy is enough to claim you're bringing the company into disrepute by drawing unwanted attention.

vova_hn211d ago

> probably-unenforceable

It's very easy to just ban the user and if your whole workflow relies on the tool, you really don't want it.

dep_b11d ago

I think stable API versions are going to be really big. I’d rather have known bugs u can work around than waking up to whatever thing got fixed that made another thing behave differently.

skeledrew11d ago

How does the product improve in such a case?

letier11d ago

They do show me “how satisfied are you with claude code today?” regularly, which can be seen as a hint. I did opt out of helping to improve claude after all.

Razengan11d ago

I knew it: https://news.ycombinator.com/item?id=47274796

belabartok3911d ago

How else are they supposed to get an authentic user test? Doctors use placebos because it doesn't work if the user knows about it.

jruz11d ago

I use stable and is the same, can't wait for Codex to offer a $100 plan I would switch in an instant

casey211d ago

This blog looks like an ad for Claude, all it's posts are about Claude and it was made in 2026

heliumtera11d ago

Someone else has the complete power over your workflow, then it's not as yours as you claim.

0gs11d ago

i'm sure your entitlement to 24/7 uptime of a single unchanging product version, no experiments/releases/new features etc., is clearly outlined in the ToS you agreed to. just sue them?

cerved11d ago

Is the a b test tired to the installation or the user?

nemo44x11d ago

They lose money at $200/month in most cases. Again, the old rules still apply. You are the product.

simonw11d ago

I'm confident "in most cases" is not correct there. If they lose money on the $200/month plan it's only with a tiny portion of users.

skeledrew11d ago

gruez11d ago

>They lose money at $200/month in most cases.

Source? Every time I see claims on profitability it's always hand wavy justifications.

lwhi11d ago

'Hand wavy' is one of my LLMs favourite terms.

nemo44x11d ago

There’s a lot of articles about it. It costs them $500+ for heavy users. They do this to capture market share and also to train their agent loops with human reinforcement learning.

https://ezzekielnjuguna.medium.com/why-anthropic-is-practica...

1 more reply

cebert11d ago

This is really frustrating.

mvrckhckr11d ago

I think it’s dishonest to use a paying client as a test subject for fundamental functionality they pay for, without their prior consent.

handfuloflight11d ago

ramozOP11d ago

doc_ick11d ago

You rent ai, you don’t own it (unless you self host).

witx11d ago

That ship has sailed. These models were trained unethically on stollen data, they pollute tremendously and are causing a bubble that is hurting people.

"Responsible" and "Ethic" are faaar gone.

onion2k11d ago

It's also worth noting that section 3.3 explicitly disallows decompilation of the app.

To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.

Always read the terms. :)

embedding-shape11d ago

> To decompile, reverse engineer, disassemble, or otherwise reduce our Services to human-readable form, except when these restrictions are prohibited by applicable law.

Luckily, it doesn't seem like any service was reverse-engineered or decompiled here, only a software that lived on the authors disk.

onion2k11d ago

Again, read the terms. Service has a specific meaning, and it isn't what you're assuming.

Don't assume things about legal docs. You will often be wrong. Get a lawyer if it's something important.

1 more reply

applfanboysbgon11d ago

Not "service" in human speech. Service, in bullshit legalese. They define their software as

> along with any associated apps, software, and websites (together, our “Services”)

2 more replies

doc_ick11d ago

“ I dug into the Claude Code binary.”

ozgrakkurt11d ago

Why should anyone care about their TOS while they are laundering people’s work at a massive scale?

mcherm11d ago

There are a bunch of reasons.

Perhaps their TOS involves additional evils they are performing in the world, and it would be good to know about that.

Perhaps their TOS is restricting the US military from misusing the product and create unmonitored killbots.

Just because you think one thing is bad about someone doesn't mean no one should ever discuss any other topic about them.

duskdozer11d ago

Because by contrast they have the money and institutional capture to make your life miserable if you don't.

1 more reply

pixl9711d ago

When a company tells you not to reverse, decompile, or disassemble their software, the first thing you should do is just that.

BAM-DevCrew11d ago

Maybe I did. Maybe I didn't.

doc_ick11d ago

^ this, I was about to double check on it when I saw you did. None of these practices sound abnormal, maybe a little sketchy but that comes with using llms.

mrgoldenbrown11d ago

So if I view source on their webpage I'm violating terms and conditions? Yikes.

ramozOP11d ago

BAM-DevCrew11d ago

shepherdjerred11d ago

We’re still on _Hacker_ News, right?

j / k navigate · click thread line to collapse