https://gist.github.com/rlueder/a3e7b1eb40d90c29f587a4a8cb7c...
An average of $0.04/review (200+ PRs with two rounds each approx.) total of $19.50 using Opus 4.6 over February.
It fills in a gap of working on a solo project and not having another set of eyes to look at changes.
So the take would be that 84% heavily Claude driven PRs are riddled with ~7.5 issues worthy bugs.
Not a great ad of agent based development quality.
It will _always_ find about 8 issues. The number doesn’t change, but it gets a bit … weird if it can’t really find a defect. Part of the art of using the tool is recognizing this is happening, and understanding it’s scraping the bottom of its barrel.
However, if there _are_ defects, it’s quite good at finding and surfacing them prominently.
For comparison, Greptile charges $30 per month for 50 reviews, with $1 per additional review.
At average of $15~25 per review, this is way more expensive.
I run a PR review via Claude on my own code before I push. It’s exceptionally good. $20 becomes an incredibly easy sell when I can review a PR in 10 minutes instead of an hour.
Also the examples are weird IMO. Unless it was an edge/corner case the authentication bug would be caught in even a smoke test. And for the ZFS encryption refactor I'd expect a static-typed language to catch type errors unless they're casting from `void*` or something. Seems like they picked examples by how important/newsworthy the areas were than the technicality of the finds.
Product A: "AI looks at your diff and leaves comments." This is getting commoditized fast. GitHub Copilot does it, Anthropic just launched it, and every startup in the space does some version of it. The value ceiling is low because it's fundamentally a linting++ problem with better language models.
Product B: "AI-native governance for when AI agents write production code." This is a different problem. When 41% of commits are AI-assisted and Dependabot, Renovate, Copilot, Cursor, Claude Code, and Devin are all opening PRs on the same repo, the question isn't "are there bugs in this diff." It's: what's the risk profile of this change? Which agent wrote it and do we trust that agent? Should this auto-merge or need human review? What's the approval workflow?
The bubble is in Product A. Product B barely exists yet, and it's where the actual value is for engineering teams trying to ship at AI speed without losing governance. The Jane Street example in this thread is interesting - their model works because they have an extremely homogeneous engineering culture. Most teams don't. They need policy-driven automation that encodes their standards.
Setup: 30 artifacts (code, docs, scripts), 150 injected errors, 4 review conditions, 360 total reviews using Claude Opus 4.6.
Results:
- Cross-Context Review (artifact only, no production history): F1 28.6%
- Same-session self-review: F1 24.6% (p=0.008 vs CCR)
- Same-session repeated review (SR2): F1 21.7%
The SR2 result is the key finding — reviewing twice in the same session doesn't help (p=0.11 vs single review). The model generates more noise, not more signal. This rules out "two looks are better than one" as an explanation. It's the context separation itself that matters.
The gap is widest on critical errors: 40% detection for CCR vs 29% for same-session review.
Mechanism: production context introduces anchoring bias + sycophancy + context rot. A fresh session eliminates all three simultaneously by removing the conditioning tokens.
What Anthropic is doing here — dispatching independent agents that never saw the production context — is essentially this principle at industrial scale. Working on a paper but not published yet.
"Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions"
If you're a big shop pushing, say, 2,000 PRs a week and reviews average $15–25, that’s on the order of $30k–$50k a week in AI review spend, or $1.5-2.5M a year. That is quite a line item to justify.
"It's $20 cheaper than a senior engineer’s hourly rate,"... so what are you actually doing with your human reviewers once you add this on?
If you keep your existing review culture and just bolt this on, then you've effectively said "we’re willing to add $1–2M+ a year to the budget." That might be fine, but then you should be able to point to fewer incidents, shorter lead times, higher coverage, something like that.
Either this is a replacement story (fewer humans, different risk profile) or it's an augmentation story (same humans, bigger bill, hopefully better outcomes). "It’s cheaper than a great engineer" by itself skips over the fact that, at scale, you’re stacking this cost on top of the engineers you already have in the org.
* https://github.com/anthropics/claude-plugins-official/tree/m...
* https://github.com/anthropics/claude-plugins-official/tree/m...
You've got to be completely insane to use AI coding tools at this point.
This is the subsidised cost to get users to use it, it could trivially end up ten times this amount. Plus, you've got the ultimate perverse incentive where the company that is selling you the model time to create the PRs is also selling you the review of the same PR.
The relevant comparison for most enterprise isn't whether $15/PR is subsidised - it's whether it beats the alternative. For most shops that's cheap offshore labour plus the principal engineer time spent reviewing it, managing it, and fixing what got merged anyway. Most enterprise code is trivial CRUD - if the LLM generates it and reviews it to an equivalent standard, you're already ahead.
If it takes 17 rounds of review from 5 different models/harnesses – I don't care. Just spit out the right code the first time. Otherwise I'm wasting my time clicking "review this" over and over until the PR is worth actually having a human look at.
I've had multiple situations where things "just worked" and at other times you just have to steer it in the right direction a few times, having another agent doing the review works really well (with the right guardrails), it's like having someone with no other intent or bias review your code.
Unless you're talking about "vibe coding" in which case "correct" doesn't really matter as you're not even looking at what the output is, just let it go back/forth until something that works comes out, I haven't had much success or even enjoyed it as much working this way, took me a couple of months to find the sweet spot (my sweetspot, I think it'll be different for everyone).
All code review tools - Anthropic's, CodeRabbit, Copilot, Greptile - answer the same question: "are there bugs in this diff?" That's important but it's table stakes. The harder question teams actually need answered is: given the risk profile of this change, the identity and track record of what generated it, and my organization's policies, should this auto-merge, go to a human, or get escalated?
At $15-25/review, Anthropic is pricing for teams that treat every PR the same. But in practice, maybe 70% of PRs from established AI agents touching low-risk code paths don't need a $20 deep review. They need risk triage and an auto-merge decision. The remaining 30% - high-risk file paths, unfamiliar agent output, auth/payment code - those are where deep review pays for itself.
The signal-to-noise problem several people mention ("always finds 8 issues") is a symptom of the same thing. Without risk context, a review tool treats a CSS tweak and a payment flow refactor with equal intensity. The solution isn't better review - it's smarter routing of what gets reviewed by what, at what depth.
https://finance.yahoo.com/news/claude-just-killed-startup-sf...
The real competition for both claude and the platforms is a skill running locally against the very same code.
It's totally worth it.
So part of the workflow becomes filtering signal vs noise.
(They mention their github action which seems more like a system prompt)