The current hype around autonomous agents, and what actually works in production (opens in new tab)

(utkarshkanwat.com)

427 pointsDachande66310mo ago257 comments

257 comments

I spoke with an Amazon AI production engineer who’s talking with prospective clients about implementing AI in our business. When a colleague asked about using generative AI in customer facing chats the engineer said he knows of zero companies who don’t have a human in the loop. All the automatic replies are non-generative “old” tech. Gen AI is just not reliable enough for anyone to stake their reputation on it.

PaulHoule10mo ago

Years ago I was interested in agents that used "old AI" symbolic techniques backed up with classical machine learning. I kept getting hired though by people who were working on pre-transformer neural nets for texts.

Something I knew all along was that you build the system that lets you do it with the human in the loop, collect evaluation and training data [1] and then build a system which can do some of the work and possibly improve the quality of the rest of it.

[1] in that order because for any 'subjective' task you will need to evaluate the symbolic system even if you don't need to train it -- if you need to train the system, on the other hand, you'll still need to eval

throwehshdhdy10mo ago

Plenty of tech companies have started using gen AI for live chat support. Off the top of head I know off sonder.com and wealthsimple.com.

If the LLM can’t answer a query it usually forwards the chat to a human support agent.

actinium22610mo ago

Air Canada did this a bit ago, and their AI gave the customer a fake process for submitting claims for some sort of discount on airfare due to bereavement (the flight was for a funeral). The customer sued and Air Canada's defense was that he shouldn't have trusted the Air Canada AI chatbot. Air Canada lost.

medbrane10mo ago

That was in 2022, before LLMs, and they "lost" as in they had to pay back $482 USD.

1 more reply

nsonha10mo ago

Of course it can but I think the issue is that people may try to jailbreak it or do something funny to get a weird response, then post of x.com against the company. There must be techniques to turn LLMs into a FAQ forwarding bot, but then what's the point of having a LLM

raxxorraxor10mo ago

That is for selling support. A drone for a consumer drone. Nothing more than a little more sophisticated advertising banner.

This is not being part of a defined workflow that requires structured output.

Arn_Thor10mo ago

Perhaps. But it’s telling that someone whose job is selling those kinds of services wasn’t aware of any personally.

nominallyfree10mo ago

"This tech works fine as long as you have a back up for when it frequently fails"

raxxorraxor10mo ago

Gen AI can only support people. In our case it scans incoming mails for patterns of order or article numbers, if the customers is already know, etc.

That isn't reliable either, but it supports the person who gets the mail on his desk in the end.

We sometimes get handwritten service protocols and the model we are using is very proficient in reading handwritten notes which you would have difficulties to parse yourself.

It works most of the time, but not often enough that AI could give autogenerated answers. For service quality reasons we don't want to impose any chatbot or AI on a customer.

Also data protection issues arise if you use most AI services today, so parsing customer contact info is a problem as well. We also rely on service partners to tell the truth about not using any data...

alpha_squared10mo ago

One thing I'll add that isn't touched on here is about context windows. While not "infinite", humans have a very large context window for problems they're specialized in solving. Models can often overcome their context window limitations by having larger and more diverse training sets, but that still isn't really a solution to context windows.

Yes, I get the context window increases over time and that for many purposes it's already sufficient enough, but the current paradigm forces you to compress your personal context into a prompt to produce a meaningful result. In a language as malleable as English, this doesn't feel like engineering so much as it feels like incantations and guessing. We're losing so, so much by skipping determinism.

lxgr10mo ago

Humans don't have this fixed split into "context" and "weights", at least not over non-trivial time spans.

For better or worse, everything we see and do ends up modifying our "weights", which is something current LLMs just architecturally can't do since the weights are read-only.

globular-toast10mo ago

This is why I actually argue that LLMs don't use natural language. Natural language isn't just what's spoken by speakers right now. It's a living thing. Every day in conversation with fellow humans your very own natural language model changes. You'll hear some things for the first time, you'll hear others less, you'll say things that get your point across effectively first time, and you'll say some things that require a second or even third try. All of this is feedback to your model.

All I hear from LLM people is "you're just not using it right" or "it's all in the prompt" etc. That's not natural language. That's no different from programming any computer system.

I've found LLMs to be quite useful for language stuff like "rename this service across my whole Kubernetes cluster". But when it comes to specific things like "sort this API endpoint alphabetically" I find the amount of time to learn to construct an appropriate prompt is the same if I'd have just learnt to program, which I already have done. And then there's the energy used by the LLM to do it's thing which is enormously wasteful.

HumblyTossed10mo ago

> All I hear from LLM people is "you're just not using it right" or "it's all in the prompt" etc. That's not natural language. That's no different from programming any computer system.

This right here is the nail on the head. When you use (a) language to ask a computer to return you a response, there's a word for that and it's "programming". You're programming the computer to return data. This is just programming at a higher level, but we've always been increasing the level at which we program. This is just a continuation of that. These systems are not magical, nor will they ever be.

alpha_squared10mo ago

I agree, I'm mostly trying to illustrate how difficult it is to fit our working model of the world into the LLM paradigm. A lot of comments here keep comparing the accuracy of LLMs with humans and I feel that glosses over so much of how different the two are.

daveguy10mo ago

Honestly we have no idea what the human split is between "context" and "weights" aside from a superficial understanding that there are long term and short term memories. The long term memory/experience seems a lot closer to context than it is to dynamic weights. We don't suddenly forget how to do a math problem when we pick up an instrument (ie our "weights" don't seem to update as easily and quickly as context does for an LLM).

antisthenes10mo ago

> humans have a very large context window for problems they're specialized in solving

Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.

Can you provide some examples of problems where humans have such large context windows?

lelanthran10mo ago

> Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.

Human context windows are not linear. They have "holes" in them which are quickly filled with extrapolation that is frequently correct.

It's why you can give a human an entire novel, say "Christine" by Stephen King, then ask them questions about some other novel until their "context window" is filled, then switch to questions about "Christine" and they'll "remember" that they read the book (even if they get some of the details wrong).

> Can you provide some examples of problems where humans have such large context windows?

See above.

The reason is because humans don't just have a "context window", they have a working memory that is also their primary source of information.

IOW, if we change LLMs so that each query modifies the weights (i.e. each query is also another training data-point), then you wouldn't need a context window.

With humans, each new problem effectively retrains the weights to incorporate the new information. With current LLMs the architecture does not allow this.

gf00010mo ago

It's a very large context window, but it is compressed down a lot. I don't know every line of insert your PL of choice's standard library, but I do know a lot of it with many different excerpts from the documentation, relevant experiences where I used this over that, or edge cases/bugs that one might fall into. Add to it all the domain knowledge for the given project, with explicit knowledge of how the clients will use the product, etc, but even stuff like what might your colleague react to to this approach vs another.

And all this can be novelly combined and reasoned with to come up with new stuff to put into the "context window", and it can be dynamically extended at any point (e.g. you recall something similar during a thought train and "bring it into context").

And all this was only the current task-specific window, which lives inside the sum total of your human experience window.

vntok10mo ago

If you're 50 years old, your personality is a product of 50-ish years. Another way to say this is that humans have a very large context window (that can span multiple decades) for solving the problem of presenting a "face" to the world (socializing, which is something humans in general are specifically good at).

KoolKat2310mo ago

Human multi-step workflows tend to have checkpoints where the work is validated before proceeding further, as humans generally aren't 99%+ accurate either.

I'd imagine future agents will include training to design these checks into any output, validating against the checks before proceeding further. They may even include some minor risk assessment beforehand, such as "this aspect is crucial and needs to be 99% correct before proceeding further".

a_bonobo10mo ago

That's what Claude Code does - it constantly stops and asks you whether you want to proceed, including showing you the suggested changes before they're implemented. Helps with avoiding token waste and 'bad' work.

taurath10mo ago

Except when it decides it doesn’t need to do that anymore or forgets

KoolKat2310mo ago

thats good to hear, theyre on their way there!

on a personal note, I'm happy to hear that. I've been apprehensive and haven't tried it, purely due to my fear of the cost.

Filligree10mo ago

The standard way to use Claude Code is with a constant-cost subscription; one of their standard website accounts. It’s rate-limited but still generous.

You can also use API tokens, yes, but that’s 5-10x more expensive. So I wouldn’t.

3 more replies

queenkjuul10mo ago

My work has a corporate subscription and on the one hand it's very impressive and on the other i don't actually find it useful.

1 more reply

csomar10mo ago

Lots of applications have to be redesigned around that. My guess is that micro-services architecture will see a renaissance since it plays well with LLMs.

lxgr10mo ago

Somebody will still need to have the entire context, i.e. the full end-to-end use case and corresponding cross-service call stack. That's the biggest disadvantage of microservices, in my experience, especially if service boundaries are aligned with team boundaries.

On the other hand, if LLMs are doing the actual service development, that's something software engineers could be doing :)

jvanderbot10mo ago

My AI tool use has been a net positive experience at work. It can take over small tasks when I need a break, clean up or start momentum, and generally provide a good helping hand. But even if it could do my job, the costs pile up really quickly. Claude Code can burn $25/ 1-2 hrs, easily on a large codebase, and that's creeping along at a net positive rate assuming I can keep it on task and provide corrections. If you automate the corrections we are up to $50/hr or some tradeoff of speed, accuracy, and cost.

Same as it's always been.

For agents, that triangle is not very well quanitfied at the moment which makes all these investigations interesting but still risky.

torginus10mo ago

My somewhat cynical 2 cents say, it that these thinking LLMs, that constantly re-prompt themselves in a loop to fix their own mistakes, combined with the 'you don't need RAG, just dump the all code into the 1m token context windows' align well with the 'we charge per token' business model.

stillsut10mo ago

One of the ideas i'm playing with is producing several rough drafts of a commit ai-generated at the outset, and then filtering these both manually and with some automations for manual refinements.

_Knowing how way leads to way_, the larger the task, the more chance there is for an early deviation to doom the viability of the solution in total. Thus for even the SOTA right now, agents that can work in parallel to generate several different solutions can reduce your time of manually refactoring the generation. I wrote a little about that process here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...

swader99910mo ago

Subscription?

jvanderbot10mo ago

I have one, and upgrades don't have unlimited access as far as I can tell. Correct me if I'm wrong.

This cost scaling will be an issue for this whole AI employee thing, especially because I imagine these providers are heavily discounting.

13zebras10mo ago

Re: discounting… Given that OpenAI is burning billions and making trivial revenue in comparison, the cost per token is probably going to skyrocket when Sam runs out of BS to con the next investor. I’m guessing the only way that token cost doesn’t explode is if Claude ends up in Amazon’s hands and OpenAI is Microsoft’s. Then Amazon, Google, and MS can subsidize if they want. But as standalone businesses, they can’t make it at current token prices. IMHO

joshvm10mo ago

There are usage limits, but the argument is that unless you're writing and modifying large swathes of code in YOLO mode, you don't hit them. At least for what I would call a small and tedious task. I'm thinking "write a docstring", "add type annotations", "write a single unit test for this case", "fill in this function". For a good prompt these are often solved in <10 interactions. Especially when combined with scoped rules that are pulled in on demand to guide output.

neom10mo ago

"The real challenge isn't AI capabilities, it's designing tools and feedback systems that agents can actually use effectively." - this part I agree with - I'd been sitting the AI stuff out because I was unclear where I thought the dust would settle or what the market would accept, but recently joined a very small startup focused on building an agent.

I've gone from skeptical to willing to humor to "yeah this is probably right" in about 5 months, basically I believe: if you scope the subject matter very very well, and then focus on the tooling that the model will require to do it's task, you get a high completion rate. There is a reluctance to lean into the non deterministic nature of the models, but actually if you provide really excellent tooling and scope super narrowly, it's generally acceptably good.

This blog post really makes the tooling part seem hard, and, well... it is, but not that hard - we'll see where this all goes, but I remain optimistic.

mritchie71210mo ago

> I've built 12+ production AI agent systems across development, DevOps, and data operations

It's hard to make *one* good product (see startup failure rates). You couldn't make 12 (as seemingly a solo dev?) and you're surprised?

we've been working on Definite[0] for 2 years with a small team and it only started getting really good in the past 6 months.

0 - data stack + AI agent: https://www.definite.app/

Rexxar10mo ago

He didn't say he made 12 independent saleable products, he says he built 12 tools that fill a need at his job and are used in production. They are probably quite simple and do a very specific task as the whole article is telling us that we have to keep it simple to have something useable.

mritchie71210mo ago

that's my point. He's "Betting Against AI Agents" without having taken a serious attempt at building one.

> agents that technically make successful API calls but can't actually accomplish complex workflows because they don't understand what happened.

It takes a long time to get these things right.

AstroBen10mo ago

They've built 12+ products with a full time job for the last 3 years

Something seems off about that...

senko10mo ago

His full time job is building AI systems for others (and the article is a well written promo piece).

If most of these are one-shot deterministic workflows (as opposed of input-llm-tool loop usually meant by the current use of the term "ai agent"), it's not hard to assume you can build, test and deploy one in a month on average.

RamblingCTO10mo ago

I also build agents/ai automation for a living. Coding agents or anything open-ended is just a stupid idea. It's best to have human validated checkpoints, small search spaces and very specific questions/prompts (does this email contain an invoice? YES/NO).

Just because we'd love to have fully intelligent, automatic agents, doesn't mean the tech is here. I don't work on anything that generates content (text, images, code). It's just slob and will bite you in the ass in the long run anyhow.

murukesh_s10mo ago

I am also building an agent framework and also used chat coding (not vibe coding) to generate work - I was easily able to save 50% of my time just by asking GPT.

But it generates mistakes like say 1 in 10 times and I do not see it getting fixed unless we drastically change the LLM architecture. In future I am sure we will have much more robust systems if the current hype cycle doesn't ruin its trust with devs.

But the hit is real, I mean I would hire a lot less If i were to hire now as I can clearly see the dev productivity boost.. Learning curve for most of the topics are also drastically reduced as the loss in Google search result quality is now supplemented by LLMs.

But thing I can vouch for is automation and more streamlined workflows. I mean having normal human tasks being augmented by an LLM in a workflow orchestration framework. The LLM can return its confidence % along with the task results and for anything less than ideal confidence % the workflow framework can fall back on a human. But if done correctly with proper testing, guardrails and all, I can see LLM is going to replace human agents in several non-critical tasks within such workflows.

The point is not replacing humans but automating most of the work so the team size would reduce. For e.g. large e-commerce firms have 100s of employees manually verifying product description, images etc, scanning for anything from typos to image mismatch to name a few. I can see LLMs going to do their job in future.

RamblingCTO10mo ago

I just left my CTO job for personal reasons. we tried coding agents, agentic coding, LLM-driven coding whatever. the code any of these generate is subpar (a junior would get the PR rejected for what it produces) and you just waste so much time prompting and not thinking. people don't know the code anymore, don't check the code and it's all just gotten very sloppy. so not hiring coders because of AI is a dangerous thing to do and I'd advise heavily against it. your risk just got way higher because of hype. maintainability is out of the window, people don't know the code and there are so many hidden deviations to the spec that it's just not worth it.

the truth is that we stop thinking when we code like that.

murukesh_s10mo ago

You are misunderstanding coding vs logic. Coding is making our logic (which is creativity) fit into someone else's syntax. If a machine is able to translate your logic expressed in your language into running code, whats wrong? Common thats our dream isn't it? Its like you not using calculator because you are worried kids won't learn how to divide. Is assembly language better than C++? Yes - did that prevented high level languages from taking over the world? No.

If done right we all code through spec written in English, not code.

1 more reply

la_fayette10mo ago

In general I would agree, however the resulting systems of such an approach tend to be "just" expensive workflow systems, which could be done with old tech as well... Where is the real need for anything LLM here?

barbazoo10mo ago

Extracting structured data from unstructured text comes to mind. We’ve built workflows that we couldn’t before by bridging a non deterministic gap. It’s a business SaaS but the folks using our software seem to be really happy with the result.

RamblingCTO10mo ago

you are 100% right. LLMs are perfect for anything that required heuristics. "is that project a good fit for client A given the following specifications ... rate it from 1-10". stuff like that. I use it as a solution for an undefined search space/problem essentially.

anon19192810mo ago

it would take months with old tech to create a bot that can check multiple websites for specific data or information? so LLM reduces the time a lot? am I wrong?

dlisboa10mo ago

Months? Scraping wasn’t a hard problem then. Classifying information is a different and more complex thing, which is what these models are very good at. Then again we had other means of classification before LLMs without having to go through chat bots.

1 more reply

stillsut10mo ago

Yes I agree: highly-focused-scope + low-stakes + high-chorelike-task is the sweet spot for agents currently.

I wrote a little about one such task, getting agents to supplement my markdown dev-log here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...

lxgr10mo ago

Human validation is certainly the most reliable way of introducing checkpoints, but there's others: Running unit tests, doing ad-hoc validations of the entire system etc.

RamblingCTO10mo ago

that goes without saying. but I'd argue: HITL is more of a workflow pattern, the rest are engineering patterns

danieltanfh9510mo ago

Same. https://danieltan.weblog.lol/2025/06/agentic-ai-is-a-bubble-...

The fundamental difference is we need HITL to reduce errors instead of HOTL which leads to the errors you mentioned

Retr0id10mo ago

> Each new interaction requires processing ALL previous context

I was under the impression that some kind of caching mechanism existed to mitigate this

blackbear_10mo ago

You have to compute attention between all pairs of tokens at each step, making the naive implementation O(N^3). This is optimized by caching the previous attention values, so that for each step you only need to compute attention between the new token and all previous ones. That's much better but still O(N^2) to generate a sequence of N tokens.

_heimdall10mo ago

Caching would only help to keep the context around, but caching would only be needed if it still ultimately needs to read and process that cached context again.

Retr0id10mo ago

You can cache the whole inference state, no?

They don't go into implementation details but Gemini docs say you get a 75% discount if there's a context-cache hit: https://cloud.google.com/vertex-ai/generative-ai/docs/contex...

_heimdall10mo ago

It that just avoids having to send the full context for follow-up requests, right? My understanding is that caching helps to keep the context around but can't avoid the need to process that context over and over during inference.

2 more replies

Too10mo ago

When inference requires maxing out the memory of a gpu, where are you planning to keep this cache? Unless there is a way to compress the context into a more manageable snapshot, the cloud provider surely won’t keep a gpu idling just for holding a conversation in memory.

ilaksh10mo ago

Yes, prompt caching helps a lot with the cost. It still adds up if you have some tool outputs with long text. I have found that breaking those out into subtasks makes the overall cost much more reasonable.

csomar10mo ago

My understanding is that caching reduce computation but the whole input is still processed. I don’t think is fully disclosing how their cache works.

LLMs degrade with long input regardless of caching.

stpedgwdgfhgdd10mo ago

Compact the conversation (CC)

vntok10mo ago

> Production systems need 99.9%+ reliability

This is not remotely true. Think of any business process around your company. 99.9% availability would mean only 1min26 per day allowed for instability/errors/downtime. Surely your human collaborators aren't hitting this SLA. A single coffee break immediately breaks this (per collaborator!).

Business Process Automation via AI doesn't need to be perfect. It simply needs to be sufficiently better than the status quo to pay for itself.

navane10mo ago

It's not just about up time. If the bridge collapses people die. Some of us aren't selling ads.

vntok10mo ago

If "the bridge collapses and people die" because the team has a 1min26 "downtime" on a specific day, which is what you are arguing, then you have much bigger problems to solve than the performance of AI agents.

GeneralMayhem10mo ago

Uptime and reliability are not the same thing. Designing a bridge doesn't require that the engineer be working 99.9% of minutes in a day, but it does require that they be right in 99.9% of the decisions they make.

1 more reply

Pasorrijer10mo ago

I think you're crossing reliability and availability.

Reliability means 99.9% of the time when I hand something off to someone else it's what they want.

Availability means I'm at my desk and not at the coffee machine.

Humans very much are 99.9% accurate, and my deliverable even comes with a list of things I'm not confident about

stavros10mo ago

An interesting comment I read in another post here is that humans aren't even 99.9% accurate in breathing, as around 1 in 1000 breaths requires coughing or otherwise cleaning the airways.

seadan8310mo ago

I would say reliability is availability times accuracy.

(Your point remains largely the same, just more precise with the updated definition replacing 'reliable' with 'accurate'.)

vntok10mo ago

> Humans very much are 99.9% accurate

This is an extraordinary claim, which would require extraordinary evidence to prove. Meanwhile, anyone who spends a few hours with colleagues in a predominantly typing/data entry/data manipulation service (accounting, invoicing, presales, etc.) KNOWS the rate of minor errors is humongous.

satyrun10mo ago

Yea exactly.

99.99% is just absurd.

The biggest variable though with all this is that agents don't have to one shot everything like a human because no one is going to pay a human to do the work 5 times over to make sure the results are the same each time. At some point that will be trivial for agents to always be checking the work and looking for errors in the process 24/7.

seadan8310mo ago

I wouldn't take the claim to mean that humans universally have an attribute called "accuracy" that is uniformly set to the value 99.9%.

The claim is pretty clearly 'can' achieve (humans) vs 'do' achieve (LLM). Therefore one example of a human building a system at 99.9% reliability is sufficient to support the claim. That we can compute and prove reliability is really the point.

For example, the function "return 3" 100% reliably counts the Rs in strawberry. We can see the answer never changes, if it is correct once therefore, it will always be correct because the answer is always the same correct answer. A LLM can't do that, and infamously gave inaccurate results to that problem, not even reaching 80% accuracy.

For the sake of discussion, I'll define reliability to be the product of availability and accuracy and will assume accuracy (the right answer) and availability (able to get any answer) to be independent variables. In my example I held availability at a fixed 100% to illustrate why being able to achieve high accuracy is required for high reliability.

So, two points: humans can achieve 100% accuracy in the systems they build because we can prove correctness and do error checking. Because LLM cannot do 100%, frankly, there is going to be a problem that shows a distinction between max capabilities. While difficult, humans can build highly reliable complex systems. The computer is an example, that all the hardware interfaces together so well and works so often is remarkable.

Second, if every step along a pipeline is 99% reliable, then after 20 steps we are no longer talking about a system that usually works, but one that _rarely_ works. For a 20 step system to work above 50%, it really needs some steps that are effectively at 100%

vrighter10mo ago

This comment makes the assumption that the software is cloud based and all that matters is uptime.

I used to work on a backup application, it ran locally on our clients' machines. We had over 10000 clients. A 99.9% reliability would mean that there are 10 of our customers, at any one point, having a problem. It's not a question of uptime. It's a question of data integrity in this case. So 99.9% reliability could even leave us open to, potentially, 10 lawsuits. Also, about 10 support calls per day.

Now we only had about 10k customers at the time. Imagine if it were millions.

lexicality10mo ago

Currently I'm thinking about how furious the developers get any time Jenkins has any kind of hiccough, even if the solution is just "re-run the workflow" - and that's just network timeouts! I don't want to imagine the tickets if the CI system started spitting out hallucinations...

hansmayer10mo ago

This may not be about internal business processes. In e-commerce 90 sec can be a lot of revenue lost, and mission-critical applications such as telecommunications or air control, it would be downright a disaster (ever heard of five nines availability)?

lerchmo10mo ago

Alot of deterministic systems externalize their edge cases to the user. The software design doesn’t fully match the reality of how it gets used. Ai can be far more flexible in the face of dynamic and variable requirements.

lukaslalinsky10mo ago

I was one of the early adopters of GitHub Copilot and generally a proponent of AI assisted coding. I've recently tried "vibe coding" and oh my god, the experience couldn't be more different to what I was used to. It feels like a super expensive machine, making all kinds of mistakes and charging me for all of them. So many trial and error attempts. I ask it to do X, it conveniently does a lot of work around it, but in the end X does not work, so it just comments it out as a minor issue. It requires so much micro management, that I don't really see the purpose. Much easier and faster to just write the code myself and let it help me in that process. With agents, I feel like I'm the one helping it get a job done. I honestly can't imagine trusting this with any kind of production process.

actinium22610mo ago

Very nice article. The point about mathematical reliability is interesting. I generally agree with it, but humans aren't 100% reliable, or even 99% reliable, so how do we manage to create things like the Linux kernel or the Mars landers without AI? Clearly we have some sort of goal-based self-correction mechanism. I wonder if there's research into AI on that thread?

an0malous10mo ago

> Clearly we have some sort of goal-based self-correction mechanism.

Humans can try things, learn, and iterate. LLMs still can't really do the second thing, you can feed back an error message into the prompt but the learning isn't being added to its weights so its knowledge doesn't compound with experience like it does for us.

I think there are still a few theoretical breakthroughs needed for LLMs to achieve AGI and one of them is "active learning" like this.

cosmic_cheese10mo ago

Additionally, LLMs still don’t truly understand anything, which is why they flounder so badly with e.g. writing code for a programming language or framework that it hasn’t seen a large enough set of training data for. Humans on the other hand do understand and generalize shared knowledge well, which is why we’re much better at handling that type of scenario.

More specific to agents, humans can also figure out how to use tools on the fly (even in the absence of documentation) where LLMs need human-built MCPs. This is also a significant limiting factor.

tkz131210mo ago

I’ve found claude to be very helpful when both writing and debugging code written in a language i’m currently building. I just make sure to load the spec into its context first and that seems to be enough for it to get a general understanding.

1 more reply

Vetch10mo ago

Compounding with learn and iterate, humans also build abstractions which significantly shorten the number of steps required. These are more expressive programming languages, compilers and toolchains. We also build engines, libraries, DSLs and invent appropriate data-structures to simplify the landscape or reuse existing work. Besides abstractions, we build tools like better type systems, error testing and borrow checkers to help eliminate certain classes of errors. Finally, after all is said and done, we still have QA teams and major bugs.

airstrike10mo ago

100% and it seems like we need a whole new architecture to get there, because right now training a model takes so much time.

At the risk of making a terrible analogy, right now we're able to "give birth" to these machines after months of training, but once they're born, they can't really learn. Whereas animals learn something new every day, got to sleep, clean up their memories a bit, deleting some, solidifying others, and waking up with an improved understanding of the world.

bot40310mo ago

Maybe you're on to something. We need AI lions which will eat the models which don't learn or adapt enough.

1 more reply

psadri10mo ago

You could instruct the LLM to formulate a “lesson” based on the error and add this to the tool instructions for future runs.

ch4s310mo ago

This isn’t practical at scale. You’ll run into too many novel lessons and burn through too many tokens setting up context.

1 more reply

bwfan12310mo ago

Humans build theories of how things work. llms dont. Theories are deterministic and symbolic. Take the turing machine for example as a theory of computation in general, euclidean geometry as a theory for space, and newtonian mechanics as a theory for motion

Even for software applications like the Linux kernel, there would have been a theory in Linus' head - for example of what an operating system is, and how it should work.

A theory gives 100% correct predictions. Although the theory itself may not model the world accurately. Such feedback between the theory, and its application in the world causes iterations to the theory. From newtonian mechanics to relativity etc. From euclidean geometry to geometry of curved spaces etc.

Long story short, the LLM is a long way away from any of this. And to be fair to LLMs, the average human is not creating theories, it takes some genius to create them (newton, turing, etc). The average human is trading memes on social media.

chubot10mo ago

I believe there was an article/paper in the last few months about that exact issue

Someone was saying that with an increasing number of attempts, or increasing context length, LLMs are less and less likely to solve a problem

(I searched for it but can't find it)

That matches my experience -- the corrections in long context can just as easily be anti-corrections, e.g. turning something that works into something that doesn't work

---

Actually it might have been this one, but there are probably multiple sources saying the same thing, because it's true:

Context Rot: How Increasing Input Tokens Impacts LLM Performance - https://news.ycombinator.com/item?id=44564248

In this report, we evaluate 18 LLMs, including the state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models. Our results reveal that models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.

---

As far this question: how do we manage to create things like the Linux kernel or the Mars landers without AI

It's because human intelligence is a totally different thing than LLMs (contrary to what interested people will tell you)

Carmack said there are at least 5 or 6 big breakthroughs left before "AGI", and I think even that is a misleading framing. It's certainly possible that "AGI" will not be reached - there could be hardware bottlenecks, software/algorithmic questions, or other obstacles we haven't thought of

That is, I would not expect AI to create anything like the Linux kernel. The burden of proof is on the people who claim that, not the other way around !!!

actinium22610mo ago

I saw your edit with the paper, but when you first mentioned it I thought you might have been referring to the Apple paper that more or less said the same thing.

Speaking of Apple, I just want to get it out there that I'm impressed that they're exhibiting self restraint in this AI era. I know they get bashed for not being "up to speed" with "the rest of the industry," but I believe they're doing this on purpose because they see what slop it is and they'd prefer to scope it down to release something more useful.

corimaith10mo ago

Humans aren't 100% reliable but we can build tools that are 100% reliable to verify our predictions.

YeGoblynQueenne10mo ago

We don't generate chains of tokens with a constant error rate so errors don't pile up. Don't ask me what we do instead for I have no clue but whatever it is, it works better than next token prediction.

Hey, maybe humans aren't just like LLMs after all.

Dachande663OP10mo ago

OP here. I posted this, this morning and then promptly forgot about it. How come the title has been changed from the blog posts own?

wrp10mo ago

That was annoying. Saw the post, then later had a hard time finding it again.

hoverbot10mo ago

Great practical takes - agentic chatbots work only with real data and feedback loops. HoverBot is built this way: it automates RAG pipelines, includes confidence thresholds, and supports human-in-loop override. You can run vertical assistants (support, sales, HR) from a single dashboard with real-world reliability.

nextworddev10mo ago

Actually, author should be bullish on autonomous agents considering 90% of what he's even able to do now wasn't even possible in early 2024, so you shouldn't bet against the slope of progress

infecto10mo ago

Link does not work for me but as someone who does a lot of work with LLMs I am also betting against agents.

Agents have captivated the minds of groups of people in each large engineering org. I have no idea what their goal is other then they work on “GenAI”. For over a year now they have been working on agents with the promise that the next framework that MSFT or Alphabet publishes will solve their woes. They don’t actually know what they are solving for except everything involves agents.

I have yet to see agents solve anything but for some reason this idea that having an agent that you can send anything and everything will solve all problems for the company. LLMs have a ton of interesting applications but agents have yet to grasp me as interesting, I also don’t understand why so many large companies have focused time around it. They are not going to be cracking the code ahead of a commercial tool or open source project. In the time spent toying around with agents there are a lot of interesting applications that could have built, some of which may be technically an agent but without so much focus and effort on trying to solve for all use cases.

Edit: after rereading my post wanted to clarify that I do think there is a place for tool call chains and the like but so many folks I have talked to first hand are trying to create something that works for everything and anything.

globular-toast10mo ago

I think in general if everyone is talking about a solution and nobody is talking about problems then it's a sign we're in a bubble.

For me the only problem I have is I find typing slow and laborious. I've always said if I could find a way to type less I would take it. That's why I've been using tab completion and refactoring tools etc for years now. So I'm kind of excited about being able to get my thoughts into the computer more quickly.

But having it think for me? That's not a problem I have. Reading and assimilating information? Again, not a problem I have. Too much of this is about trying to apply a solution where there is no problem.

georgeplusplus10mo ago

Maybe you are in a job where it’s not a good use case but there are fields that are handling massive amounts of data or have a huge amount of time waiting for processing data before moving to the next step that I think handing it off to an AI agent to solve then a human puts the pieces together based on its own logic and experiences would work quite nice.

satyrun10mo ago

The HN fallacy that is a large % of the posts on AI

"AI is not good for what I do, therefore AI is useless"

apwell2310mo ago

not quite sure what you are proposing here. what exactly is AI agent solving in this example?

I keep hearing vague stuff exactly like your comment at work from management. Its so infuriating.

1 more reply

A4ET8a8uTh0_v210mo ago

<< I also don’t understand why so many large companies have focused time around it. They are not going to be cracking the code ahead of a commercial tool or open source project.

I think it is a mix of fomo and the 'upside' potential of being able to minimize ( ideally remove ) the expensive "human component". Note, I am merely trying to portray a specific world model.

<< In the time spent toying around with agents there are a lot of interesting applications that could have built, some of which may be technically an agent but without so much focus and effort on trying to solve for all use cases.

Preaching to the choir man. We just got custom AI tool ( which manages to have all my industry specific restrictions rendering it kinda pointless, low context making it annoying, and slower than normal, because it now has to go through several layers of approval including 'bias' ).

At the same time, committee bickers over minute change to a process that has effectively no impact on anything of value.

Bonkers.

dickersnoodle10mo ago

>I think it is a mix of fomo and the 'upside' potential of being able to minimize ( ideally remove ) the expensive "human component". Note, I am merely trying to portray a specific world model.

IOW, it's a case of C-suite "monkey see, monkey do" kicked off by management consultants with crap to sell for very high prices...

johnisgood10mo ago

I have no idea what agents are for, could be my own ignorance.

That said, I have been using LLMs for a while now with great benefit. I did not notice anything missing, and I am not sure what agents bring to the table. Do you know?

ivape10mo ago

You are a manual agent to LLMs when you use things like ChatGPT. You go through a workflow loop when you try to investigate and consult with an LLM. Agents are just trying to automate your workflow against an LLM. It's basically just scripting. Scripting these LLMs is where we all want to go, but the context window length is a limiting factor, as well as inferencing on any notable sized window.

I'll manage my whiney emotions over the term Agents, but you'll have to hold a gun to my head before I embrace "Agentic", which is a thoroughly stupid word. "Scripted workflow" is what it is, but I know there are some true "visionaries" out there ready to call it "Sentient workflow".

johnisgood10mo ago

Exactly, thank you.

What I am doing is definitely manual, it is the old-fashioned prompt-copy-paste-test-repeat cycle, but it has been educational.

stavros10mo ago

I will join you in the fight against "agentic". Ridiculous.

mhog_hn10mo ago

An agent is an LLM + a tool call loop - it is quite a step up in terms of value in my experience

jsemrau10mo ago

Agents are more than that.

Agents, besides tool use, also have memory, can plan work towards a goal, and can, through an iterative process (Reflect - Act), validate if they are on the right track.

1 more reply

infecto10mo ago

Not a disagreement with you but wanted to further clarify.

I do think it’s a step up when done correctly. Thinking of tools like Cursor. Most of my concern and issue comes from the amount of folks I have seen trying to great a system that solves everything. I know in my org people were working on Agents without even a problem they were solving for. They are effectively trying to recreate ChatGPT which to me is a fools errand.

1 more reply

johnisgood10mo ago

What is the use case? What does it solve exactly, or what practical value does it give you? I am not sure what a tool call loop is.

5 more replies

JKCalhoun10mo ago

Link is working for me — perhaps it was not 30 minutes ago? (Safari, MacOS)

wooque10mo ago

[flagged]

infecto10mo ago

That’s a bit reductive and misses the core issue. Of course companies want to reduce headcount or boost productivity, but many are pursuing these initiatives without a clear problem in mind. If the mandate were, say, “we’re building X to reduce customer support staff by 20%,” that would be a different story. Instead, it often feels like solution-first thinking without a clear target.

Edit: not even going to reply to comments below as they continue down a singular path of oh you ought to know what they are trying to do. The only point I was making is orgs are going solution-first without a real problem they are trying to solve and I don’t think that is the right approach.

exe3410mo ago

> “we’re building X to reduce customer support staff by 20%,”

I've never understood the "do X to increase/decrease Y by Z%". I remember working at McDonalds and the managers worked themselves up into a frenzy to increase "sale of McSlurry by 10%". All it meant was that they nagged people more and sold less of something else. It's not like people's stomachs got 10% larger.

1 more reply

figassis10mo ago

That is not a goal that can be shared without alienating the current workforce. So you can bet that goal was clearly stated at CXO level, and is being communicated/translated piece wise as let’s find out how much more productive we can get with AI. You’re going to find out about the goal once you reach it.

That is not to say you should work against your company, but bear in mind this is a goal and you should consider where you can add value outside of general code factory productivity and how for example you can become a force multiplier for the company.

sfink10mo ago

I agree, and would like to hear examples of where this has not been the case. I'm sure they're out there. But pretty much everything has been "how can we use LLMs" and "it doesn't matter if it was a problem that we had that needed to be solved; we need to gain experience now because AI is The Future and we can't be left behind".

Occasionally it works and people stumble across a problem worth solving as they go about applying their solution to everything. But that's not planning or top-down direction. That's not identifying a target in advance.

apwell2310mo ago

yes my organization head at my employer has asked us to submit: "Generative AI Agent" proposals for upcoming planning session. Apparently those ideas will get the big seat at the planning table. I've been trying to think of many ideas but they all end up being some sort of workflow automation that was possible without agent stuff.

Agreed with your annoyance at "they are replacing you" comments. like duh. Thats what they've been doing forever.

oceanparkway10mo ago

I think the "math" on reliability-over-steps will end up differently than described here in the long term because getting new factual input from the real world should improve the reliability of the end state, and we have all observed agentic systems at this point producing that behavior at least sometimes (e.g., a test failure prompts claude code to refactor correctly).

Whether or not one term in this equation currently compounds faster is a good question, or under what circumstances, etc., but presenting agentic abilities as always flawed thinking resulting in impossible long term task execution isn't right. Humans are flawed and require long, drawn out multi task thinking to get correct answers, and interacting with and getting feedback from the world outside the mind during a task execution process typically raises the chance of the correct answer being spit out in the end.

I'd agree that the agentic math isn't great at the moment, but if it's possible to reduce hallucinations or raise the strength and frequency effect of real world feedback on the model, you could see this playing out differently perhaps quite soon. There's at least a couple of examples of "we're already there".

afro8810mo ago

> Error rates compound exponentially in multi-step workflows. 95% reliability per step = 36% success over 20 steps. Production needs 99.9%+.

This misses a key feature of agents though. They get feedback from linters, build logs, test runs and even screenshots. And they collect this feedback themselves. This means they can error correct some mistakes along the way.

The math works out differently, depending on how well it can collect automated feedback it is doing what you want.

whazor10mo ago

Correct, I think it is better to see it as multiple stages. Investigation stage might spin off tasks to read files, perform searches online, summarise the request. Then ‘main stage’ where it performs changes. Afterwards indeed the testing+fixing stage where it verifies the results and potentially performs a couple fixes. These plans are predictable and the models learn which steps are relevant first particular projects.

For context, relevant information from steps can be cherrypicked to next stage.

The math works differently because AI (mostly) ignores irrelevant results. So steps actually increase reliability overall.

ankit21910mo ago

These are all solvable problems. The issue is given the race to get to a certain ARR quickly, many startups end up not focusing on these. There is some truth to AI agents being not as useful as their promise, but the problems mentioned are engineering problems, and once we start seeing them with a different lens, they would start working. (This is not to say I believe orchestration or multi step agents are a way to go, I personally lean heavily towards RL. Just that the criticisms here assumes the state would remain the same even without AI advancement).

Eg: you need good verifiers (to understand whether a task is done successfully or not). Many tasks have easier verifications than doing the task. YOu have five parallel generations with 80% accuracy, the probablity of getting one right (and a verifier which can pick that) goes to 99.96%. With multi step too, the math changes in a similar manner. It just needs a different approach than how we have built software till date. He even hints at a paradigm with 3-5 discrete step workflow which works superbly well. We need to build more in that way.

majormajor10mo ago

> Many tasks have easier verifications than doing the task.

In the software world (like the article is talking about) this is the logic that has ruthlessly cut software QA teams over the years. I think quality has declined as a result.

Verifiers are hard because the possible states of the internal system + of the external world multiply rapidly as you start going up the component chain towards external-facing interfaces.

That coordination is the sort of thing that really looks appealing for LLMs - do all the tedious stuff to mock a dependency, or pre-fill a database, etc - but they have an unfortunate tendency to need to be 100% correct in order for the verification test that depends on them to be worth anything. So you can go further down the rabbit hole, and build verifiers for each of those pre-conditions. This might recurse a few times. Now you end up with the math working against you - if you need 20 things to all be 100%, then even high chances of each individual one starts to degrade cumulatively.

A human generally wouldn't bother with perfect verification of every case, it's too expensive. A human would make some judgement calls of which specific things to test in which ways based on their intimate knowledge of the code. White box testing is far more common than black box testing. Test a bunch of specific internals instead of 100% permutations of every external interface + every possible state of the world.

But if you let enough of the code to solve the task be LLM-generated, you stop being in a position to do white-box testing unless you take the time to internalize all the code the machine wrote for you. Now your time savings have shrunk dramatically. And in the current state of the world, I find myself having to correct it more often then not, further reducing my confidence and taking up more time. In some places you can try to work around this by adjusting your interfaces to match what the LLM predicts, but this isn't universal.

---

In the non-software world the situation is even more dire. Often verification is impossible without doing the task. Consider "generate a report on the five most promising gaming startups" - there's no canonical source to reference. Yet these are things people are starting to blindly hand off to machines. If you're an investor doing that to pick companies, you won't even find out if you're wrong until it's too late.

ankit21910mo ago

This is not an NxM verifier hell. I explicitly talked about one way which is parallel generation + classifier. You can also use majority voting here. Both would give you the right answer at each step without having to write code or test cases, just a simple prompt. There are more ways to do the same, eg: verifier blocks, layering, backtracking search (end to end assertions and then see which step went wrong), simple generative verifiers with simpler prompts and so on.

For non software world, people use majority voting most of the time.

eska10mo ago

That’s a common fallacy. I suggest you make a plot of failure rate vs amount of components that can fail, any one of them failing leading to a total failure. You’ll be shocked by how quickly you get terrible numbers.

ankit21910mo ago

I talk about it from experience. How else do you think people are training RL agents if not based on verifiers? You don't have to verify every output at every step, you just need enough to course correct the agent and catch early when it's going wrong. That is the exact fallacy I was trying to address. The optimization comes from realizing the critical checks and then what passes to the next step. Requires letting go of the previous thinking and changing paradigms.

The failure rate is high because you view it in series. At test time you need to know what is correct from the options (including nothing correct), you dont need to know why it failed. You can debug later. The challenge is how easily can you return to the right track.

throwaway42334210mo ago

Is it reasonable to assume the five generations are independent?

ankit21910mo ago

They are not completely independent. It's a good assumption though. If a model encounters something out of distribution then all five of the generations will fail. If the model knows and went in a wrong direction (due to lack of reliability), within five generations, it can be corrected. You need evals, runtime verifiers as basic harness for AI systems.

jackblemming10mo ago

This is correct. Multiple different agents trying, multiple retries, and many other different solutions can help with this. I have seen agents try one method, get negative feedback, and then try another working method.

hannofcart10mo ago

> Let's do the math. If each step in an agent workflow has 95% reliability, which is optimistic for current LLMs,then: 5 steps = 77% success rate 10 steps = 59% success rate 20 steps = 36% success rate Production systems need 99.9%+ reliability.

(End quote)

Isn't this just wrong? Isn't the author conflating accuracy of LLM output in each step to accuracy of final artifact which is a reproducible deterministic piece of code?

And they're completely missing that a person in the middle is going to intervene at some point to test it and at that point the output artifact's accuracy either goes to 100% or the person running the agent would backtrack.

Either am missing something or this does not seem well thought through.

vrighter10mo ago

How is it that the final result is a reproducible deterministic piece of code, when the prompts become the "source code" itself, and the underlying model used is constantly changing (being updated), which is equivalent to your programming language changing its semantics every other day and refusing to tell you exactly what has changed (because they can't). Not to mention the nondeterminism that a lot of times is present due to nondeterministic order of evaluation when parallelizing?

coliveira10mo ago

He's not wrong. The numbers are too pessimistic, however when building software the numbers don't need to be as high for a complete disaster to happen. Even if just 1% of the code is bad, it is still very difficult to make this work.

And you mention testing, which certainly can be done. But when you have a large product and the code generator is unreliable (which LLMs always are), then you have to spend most of your time testing.

hungryhobbit10mo ago

Did you even finish the article? The end is all about the trade-off of when "a person in the middle is going to intervene".

In fact, the point of the whole article isn't that AI doesn't work; to the contrary, it's that long chains of (20+) actions with no human intervention (which many agentic companies promise) don't work.

lmeyerov10mo ago

I used to believe the error rate fallacy, but:

1. Multi-turn agents can correct themselves with more steps, so the reductive error cascade thinking here is more wrong than right in my experience

2. The 99.9% production requirement is so contextual and misleading, when the real comparison is often something like "outage", "dead air", "active incident", "nobody on it", "prework before/around human work", "proactive task no one had time for before", etc.

Similar to infra as code, CI, and many other automation processes, there's mountains of work that isn't being done and LLMs can do entirely or large swathes of

vrighter10mo ago

How about those large swaths are done with LLMs, but instead of spending all that time reviewing that code (really reviewing it, not just a brief LGTM), which would make the time savings moot, you just decide to personally assume responsibility for that code being dead wrong sometimes and the consequences it causes (you cannot blame the AI. As far as anyone is concerned, you wrote the code and signed off on it). As in, legal liability. Would you take the deal?

thedudeabides510mo ago

See "Campbell's Completeness Conjecture" https://www.campbellramble.ai/p/dont-trust-machines

constantcrying10mo ago

No, it is not "mathematically impossible". It is empirically implausible. There is no statement in mathematics that says that agents can not have a 99.999% reliability rate.

Also, if you look at any human process you will realize that none of them have a 100% reliability rate. Yet, even without that we can manufacture e.g. a plane, something which takes millions of steps, each without a 100% success rate.

I actually think the article makes some good points, but especially when you are making good points it is unnecessary to stretch credibility with exaggerating your arguments.

macleginn10mo ago

This is a good point, but it seems, empirically, that most parts of a standard passenger airplane have reliability approximating 100% in a predefined time window with proper inspection and maintenance, otherwise passenger transit would be impossible. When the system does start to degrade, e.g. because replacement parts and maintenance becomes unavailable or too costly (cf. the use of imported planes by Russian airlines after the sanctions hit), incidents quickly start piling up.

constantcrying10mo ago

It's about what you do with errors. If you let them compound they lead to destruction, if instead you inspect, maintain, reinspect, replace, etc. you can manage them.

My point was that something extremely complex, like a plane, works, because the system tries hard to prevent compounding errors.

sarchertech10mo ago

That works because each plane is (nearly) exactly the same as the one before it and we have exact specifications for the plane.

You can do maintenance, inspections, and replacement because of those specifications.

In software the equivalent of blueprints is code. The room for variation outside software “specifications” is infinite.

Human reliability when comes to assembling planes is also much higher than 99%, and LLM reliability creating code is much, much lower than 99%.

1 more reply

john_minsk10mo ago

Valid point, however the promise of AI is that it will be able to manufacture a metaphorical “plane” for each and every prompt user inputs I.e. give 100% overall reliability by using all kinds of techniques (testing, decomposing etc) that intelligence can come up with.

So until these techniques are baked into the model by OpenAI, you have to come up with these ideas yourself.

deadbabe10mo ago

I just want someone to give me one legit use case where an AI Agent now enables them to do something that couldn’t be done before, and actually makes an impact on overall profit.

stavros10mo ago

I can write code I wouldn't have been bothered to before, and make money from it.

Xmd5a10mo ago

>A database query might return 10,000 rows, but the agent only needs to know "query succeeded, 10k results, here are the first 5." Designing these abstractions is an art.

It seems the author never used prompt/workflow optimization techniques.

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow https://arxiv.org/pdf/2501.16673

esac10mo ago

This is exactly right! I'm happy people are starting to care about compounding errors, we use the sigma terminology: https://www.silverstream.ai/blog-news/2sigma

Agents are digital manufacturing machines and benefit from the same processes we identified for reliability in the real world

snappr02110mo ago

The alternative is building Functional Intelligence process flows from the ground up on a foundation of established truth?

If 50% of training data is not factually accurate, this needs to be weeded out.

Some industries require a first principles approach, and there are optimal process flows that lead to accurate and predictable results. These need research and implementation by man and machine.

wrp10mo ago

TFA is a bit rambling and readers are getting distracted by specific claims, like the bit about 99.9%+ reliability. TFAs main point is that productive use of AI agents requires tightly specified context and frequent human intervention, which is what folks have been saying for a while.

hemantv10mo ago

Llm are great reflections. Issues I have come across too large of context confuse the llm.

Second since llm are non deterministic in nature how do you know if the quality went from 90% to 30% there is no test you can write. What if model provider degrades quality you have no test for it

rudderdev10mo ago

Same experience. I started in mid 2023 building agents. The agents that are still working on production - they do one specific thing and the only coordination/integration I allow is via automation framework based on deterministic api only.

a_bonobo10mo ago

>Enterprise systems aren't clean APIs waiting for AI agents to orchestrate them. They're legacy systems with quirks, partial failure modes, authentication flows that change without notice, rate limits that vary by time of day, and compliance requirements that don't fit neatly into prompt templates.

Perhaps that's why MCP as a protocol is so interesting to people - MCP servers are a chance at a 'blank slate' in front of the enterprise system. You pull out only the parts you're interested in, you get to define clear boundaries when you build the MCP server, the LLM sees only what you want it to see and you hide the messiness of the enterprise system.

eska10mo ago

It’s on the wrong layer for that. MCP is akin to putting GraphQL over an old and crufty SOAP interface. There’s some “intelligence” in the GraphQL layer, but it doesn’t fix flaws in the lower layer such as side effects that shouldn’t be there.

physicsguy10mo ago

That's not really much different to writing a REST-ful interface over the top, and sticking an OpenAPI compliant schema on the front.

dmezzetti10mo ago

It's clear that what we currently call AI is best suited for augmentation not automation. There are a lot of productivity gains available if you're willing to accept that.

andrekandre10mo ago

  > AI is best suited for augmentation not automation.

i agree with this sentiment, but with the caveat of "when its not lying to you".

the most frustrating part of these interactive ai assistants is when it sends me down a rabbit hole of an api that doesn't exist (but looks almost right)

johndhi10mo ago

From what I understand customer support chatbots have had some pretty good outcomes from ai agents. Or does that not count?

nsypteras10mo ago

I think that would be one of the success cases described in the article because HITL is an integral part of good customer support chatbots. Support chats can be escalated to a human whenever the agent is unable to provide a satisfactory answer to the user.

yunyu10mo ago

"Your fancy AI scaffolds will be washed away by scale." - Noam Brown

arendtio10mo ago

The compounding error rate in long-running processes is just one side of the coin. You can also use models to catch errors, and those success rates compound as well. So, it's not like you have no options to fight against a giant failure rate monster...

Apocryphon10mo ago

Building shovels during a fool’s gold rush, nice

swyx10mo ago

> The Mathematical Reality No One Talks About

literally everybody talks about this lmao what are you on about https://www.youtube.com/watch?v=d5EltXhbcfA

arisAlexis10mo ago

"forever"?

rco878610mo ago

I still don’t even know what an agent is. Everyone seems to have their own definition. And invariably it’s generic vagaries about architecture, responsibilities of the LLM, sub-agents, comparisons to workflows, etc.

But still not once have I seen an actual agent in the wild doing concrete work.

A “No True Agent” problem if you will.

iamjackg10mo ago

Technically speaking, Claude Code is an agent, for example. It's just a fancy term for an LLM that can call tools in a loop until it thinks it's done with whatever it was tasked to do.

ChatGPT's Deep Research mode is also an agent: it will keep crawling the web and refining things until it feels it has enough material to write a good response.

tomhow10mo ago

[stub for offtopicness]

Simon_O_Rourke10mo ago

Don't tell management about this, as they're all betting the house on AI agents next year.

pmg10110mo ago

Only one of these outcomes will be correct, so worth putting money on it if you think they're wrong a la The Big Short.

DavidPiper10mo ago

Not OP, but I've been thinking about this and concluded it's not quite so clear-cut. If I was going to go down this path, I think I would bet on competitors, rather than against incumbents.

My thinking: In a financial system collapse (a la The Big Short), the assets under analysis are themselves the things of value. Whereas betting on AI to collapse a technology business is at least one step removed from actual valuation, even assuming:

1. AI Agents do deliver just enough, and stay around long enough, for big corporations to lay off large number of employees

2. After doing so, AI quickly becomes prohibitively expensive for the business

3. The combination of the above factors tank business productivity

In the event of a perfect black swan, the trouble is that it's not actually clear that this combination of factors would result in concrete valuation drops. The business just "doesn't ship as much" or "ships more slowly". This is bad, but it's only really bad if you have competitors that can genuinely capitalise on that stall.

An example immediately on-hand: for non-AI reasons, the latest rumors are that Apple's next round of Macbook Pros will be delayed. This sucks. But isn't particularly damaging to the company's stock price because there isn't really a competitor in the market that can capitalise on that delay in a meaningful way.

Similarly, I couldn't really tell you what the most recent non-AI software features shipped by Netflix or Facebook or X actually were. How would I know if they're struggling internally and have stopped shipping features because AI is too expensive and all their devs were laid off?

I guess if you're looking for a severe black swan to bet against AI Agents in general, you'd need to find a company that was so entrenched and so completely committed to and dependent on AI that they could not financially survive a shock like that AND they're in a space where competitors will immediately seize advantage.

Don't get me wrong though, even if there's no opportunity to actually bet against that situation, it will still suck for literally everyone if it eventuates.

2 more replies

Quarrelsome10mo ago

shorting only works if people realise it when you do. c-suite will run out of make up before admitting its a pig because the pay off is huge for them. I reckon agentic dev can function "just enough" to allow them to delay the reality for a bit while they fire more of their engineering team.

I don't think this one is worth shorting because there's no specific event to trigger the mindshare to start moving and validating your position. You'd have to wait for very big public failures before the herd start to move.

exe3410mo ago

Do you have suggestions on how one would go about doing this? Do you just approach a betting company and make some prediction against some wager?

ptero10mo ago

While true, the world doesn't end in 2025. While I would also agree that big financial benefits from agents to companies appear unlikely to arrive this year (and the title specifically mentions 2025) I would bet on agents becoming a disruptive technology in the next 5-10 years. My 2c.

1 more reply

immibis10mo ago

Shorting is rarely worth it without detailed information, because you also have to get the timing right. If you short AI now but it crashes in two years, chances are good that you lost a lot of money.

trentnix10mo ago

They're just following the herd.

paradite10mo ago

This is obviously AI generated, if that matters.

And I have an AI workflow that generates much better posts than this.

Retr0id10mo ago

I think it's just written by someone who reads a lot of LLM output - lots of lists with bolded prefixes. Maybe there was some AI-assistance (or a lot), but I didn't get the impression that it was AI-generated as a whole.

paradite10mo ago

"Hard truth" and "reality check" in the same post is dead giveaway.

I read and generate hundreds of posts every month. I have to read books on writing to keep myself sane and not sound like an AI.

2 more replies

delis-thumbs-7e10mo ago

I wonder why a person from Bombay India might use AI to aid with an English language blog post…

Perhaps more interesting is whether their argument is valid and whether their math is correct.

jrexilius10mo ago

The thing that sucks about it is maybe his english is bad (not his native language) so he relies on LLM output for his posts. Im inclined to cut people slack for this. But the rub is that it is indistinguishable from spam/slop generated for marketing/ads/whatever.

Or it's possible that he is one of those people that _realy_ adopted LLMs into _all_ their workflow, I guess, and he thinks the output is good enough as is, because it captured his general points?

LLMs have certainly damaged trust in general internet reading now, that's for sure.

paradite10mo ago

I am not pro or against AI-generated posts. I was just making an observation and testing my AI classifier.

fleebee10mo ago

The graphs don't line up. I'm inclined to believe they were hallucinated by an LLM and the author either didn't check them or didn't care.

Judging by the other comments this is clearly low-effort AI slop.

> LLMs have certainly damaged trust in general internet reading now, that's for sure.

I hate that this is what we have to deal with now.

1 more reply

kerkeslager10mo ago

Real question: what's the best way to short AI right now?

arealaccount10mo ago

Just short any of the publicly traded companies with AI based valuations? Nvida, Meta? Seems like an awful idea but I'm often wrong.

kerkeslager10mo ago

Nvidia and Meta are both involved in a lot more than AI. There are maybe other reasons to short Meta, but either is definitely not a pure AI play.

stavros10mo ago

I mean, I wouldn't bet against AI, but I'm also not certain the current AI company valuations are realistic.

raincole10mo ago

> In a Nutshell

> AI tools aren't perfect yet. They sometimes make mistakes, and they can't always understand what you are trying to do. But they're getting better all the time, In the future, they will be more powerful and helpful. They'll be able to understand your code even better, and they'll be able to generate even more creative ideas.

From another post on the same site. [0]

Yup, slop.

[0]: https://utkarshkanwat.com/writing/review-of-coding-tools/

d4rkn0d3z10mo ago

"Let's do the math. "

This phrase is usually followed by some, you know...Math?

Gigachad10mo ago

The article is slop. That’s just a phrase ChatGPT uses a lot.

cmsefton10mo ago

2015? The title should be 2025.

RustyRussell10mo ago

2015? Title is correct, this is a typo

tomhow10mo ago

Sorry about that, my fault, moderating from my phone.

rvz10mo ago

Let's get a timer to watch this fall off the front page of HN in minutes.

"We can't allow this post to create FUD about the current hype on AI agents and we need the scam to continue as long as possible".

saadatq10mo ago

we need a flag button for “written by AI”.

I’m at this stage where I’m fine with AI generated content. Sure, the verbosity sucks - but there’s an interesting idea here, but make it clear that you’ve used AI, and show your prompts.

vntok10mo ago

Generally speaking, low quality posts don't spend too much time on the front page, regardless of their topic.

rvz10mo ago

... and it's gone. Stopped the timer on 2 hours and 38 mins.

roschdal10mo ago

AI is for people without natural intelligence.

satyrun10mo ago

Yea just average IQ like Terence Tao.

All you are really saying with this comment is you have an incredibly narrow set of interests and absolutely no intellectual curiosity.

bboygravity10mo ago

So it's for 90+ percent of society?

Sounds like good business to me.

block_dagger10mo ago

Downvotes are for comments like yours

digitcatphd10mo ago

I’m sure most of the problems cited in this article will be easily solved within the next five years or so, waiting for perfection and doing nothing won’t pay dividends

atomon10mo ago

Is the main point “let me mathematically prove that it’s impossible to do what I’ve already done 12 times this year?”

Yes, very long workflows with no checks in between will have high error rates. This is true of human workflows too (which also have <100% accuracy at each step). Workflows rarely have this many steps in practice and you can add review points to combat the problem (as evidenced by the author building 12 of these things and not running into this problem)

j / k navigate · click thread line to collapse

257 comments

Arn_Thor10mo ago

PaulHoule10mo ago

throwehshdhdy10mo ago

Plenty of tech companies have started using gen AI for live chat support. Off the top of head I know off sonder.com and wealthsimple.com.

If the LLM can’t answer a query it usually forwards the chat to a human support agent.

actinium22610mo ago

medbrane10mo ago

That was in 2022, before LLMs, and they "lost" as in they had to pay back $482 USD.

1 more reply

nsonha10mo ago

raxxorraxor10mo ago

That is for selling support. A drone for a consumer drone. Nothing more than a little more sophisticated advertising banner.

This is not being part of a defined workflow that requires structured output.

Arn_Thor10mo ago

Perhaps. But it’s telling that someone whose job is selling those kinds of services wasn’t aware of any personally.

nominallyfree10mo ago

"This tech works fine as long as you have a back up for when it frequently fails"

raxxorraxor10mo ago

Gen AI can only support people. In our case it scans incoming mails for patterns of order or article numbers, if the customers is already know, etc.

That isn't reliable either, but it supports the person who gets the mail on his desk in the end.

We sometimes get handwritten service protocols and the model we are using is very proficient in reading handwritten notes which you would have difficulties to parse yourself.

It works most of the time, but not often enough that AI could give autogenerated answers. For service quality reasons we don't want to impose any chatbot or AI on a customer.

alpha_squared10mo ago

lxgr10mo ago

Humans don't have this fixed split into "context" and "weights", at least not over non-trivial time spans.

For better or worse, everything we see and do ends up modifying our "weights", which is something current LLMs just architecturally can't do since the weights are read-only.

globular-toast10mo ago

All I hear from LLM people is "you're just not using it right" or "it's all in the prompt" etc. That's not natural language. That's no different from programming any computer system.

HumblyTossed10mo ago

> All I hear from LLM people is "you're just not using it right" or "it's all in the prompt" etc. That's not natural language. That's no different from programming any computer system.

alpha_squared10mo ago

daveguy10mo ago

antisthenes10mo ago

> humans have a very large context window for problems they're specialized in solving

Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.

Can you provide some examples of problems where humans have such large context windows?

lelanthran10mo ago

> Do they? I certainly don't. I don't know if it's my memory deficiency, but I frequently hit my "context window" when solving problems of sufficient complexity.

Human context windows are not linear. They have "holes" in them which are quickly filled with extrapolation that is frequently correct.

> Can you provide some examples of problems where humans have such large context windows?

See above.

The reason is because humans don't just have a "context window", they have a working memory that is also their primary source of information.

IOW, if we change LLMs so that each query modifies the weights (i.e. each query is also another training data-point), then you wouldn't need a context window.

With humans, each new problem effectively retrains the weights to incorporate the new information. With current LLMs the architecture does not allow this.

gf00010mo ago

And all this was only the current task-specific window, which lives inside the sum total of your human experience window.

vntok10mo ago

KoolKat2310mo ago

Human multi-step workflows tend to have checkpoints where the work is validated before proceeding further, as humans generally aren't 99%+ accurate either.

a_bonobo10mo ago

taurath10mo ago

Except when it decides it doesn’t need to do that anymore or forgets

KoolKat2310mo ago

thats good to hear, theyre on their way there!

on a personal note, I'm happy to hear that. I've been apprehensive and haven't tried it, purely due to my fear of the cost.

Filligree10mo ago

The standard way to use Claude Code is with a constant-cost subscription; one of their standard website accounts. It’s rate-limited but still generous.

You can also use API tokens, yes, but that’s 5-10x more expensive. So I wouldn’t.

3 more replies

queenkjuul10mo ago

My work has a corporate subscription and on the one hand it's very impressive and on the other i don't actually find it useful.

1 more reply

csomar10mo ago

Lots of applications have to be redesigned around that. My guess is that micro-services architecture will see a renaissance since it plays well with LLMs.

lxgr10mo ago

On the other hand, if LLMs are doing the actual service development, that's something software engineers could be doing :)

jvanderbot10mo ago

Same as it's always been.

For agents, that triangle is not very well quanitfied at the moment which makes all these investigations interesting but still risky.

torginus10mo ago

stillsut10mo ago

One of the ideas i'm playing with is producing several rough drafts of a commit ai-generated at the outset, and then filtering these both manually and with some automations for manual refinements.

swader99910mo ago

Subscription?

jvanderbot10mo ago

I have one, and upgrades don't have unlimited access as far as I can tell. Correct me if I'm wrong.

This cost scaling will be an issue for this whole AI employee thing, especially because I imagine these providers are heavily discounting.

13zebras10mo ago

joshvm10mo ago

neom10mo ago

This blog post really makes the tooling part seem hard, and, well... it is, but not that hard - we'll see where this all goes, but I remain optimistic.

mritchie71210mo ago

> I've built 12+ production AI agent systems across development, DevOps, and data operations

It's hard to make *one* good product (see startup failure rates). You couldn't make 12 (as seemingly a solo dev?) and you're surprised?

we've been working on Definite[0] for 2 years with a small team and it only started getting really good in the past 6 months.

0 - data stack + AI agent: https://www.definite.app/

Rexxar10mo ago

mritchie71210mo ago

that's my point. He's "Betting Against AI Agents" without having taken a serious attempt at building one.

> agents that technically make successful API calls but can't actually accomplish complex workflows because they don't understand what happened.

It takes a long time to get these things right.

AstroBen10mo ago

They've built 12+ products with a full time job for the last 3 years

Something seems off about that...

senko10mo ago

His full time job is building AI systems for others (and the article is a well written promo piece).

RamblingCTO10mo ago

murukesh_s10mo ago

I am also building an agent framework and also used chat coding (not vibe coding) to generate work - I was easily able to save 50% of my time just by asking GPT.

RamblingCTO10mo ago

the truth is that we stop thinking when we code like that.

murukesh_s10mo ago

If done right we all code through spec written in English, not code.

1 more reply

la_fayette10mo ago

barbazoo10mo ago

RamblingCTO10mo ago

anon19192810mo ago

it would take months with old tech to create a bot that can check multiple websites for specific data or information? so LLM reduces the time a lot? am I wrong?

dlisboa10mo ago

1 more reply

stillsut10mo ago

Yes I agree: highly-focused-scope + low-stakes + high-chorelike-task is the sweet spot for agents currently.

I wrote a little about one such task, getting agents to supplement my markdown dev-log here: https://github.com/sutt/agro/blob/master/docs/case-studies/a...

lxgr10mo ago

Human validation is certainly the most reliable way of introducing checkpoints, but there's others: Running unit tests, doing ad-hoc validations of the entire system etc.

RamblingCTO10mo ago

that goes without saying. but I'd argue: HITL is more of a workflow pattern, the rest are engineering patterns

danieltanfh9510mo ago

Same. https://danieltan.weblog.lol/2025/06/agentic-ai-is-a-bubble-...

The fundamental difference is we need HITL to reduce errors instead of HOTL which leads to the errors you mentioned

Retr0id10mo ago

> Each new interaction requires processing ALL previous context

I was under the impression that some kind of caching mechanism existed to mitigate this

blackbear_10mo ago

_heimdall10mo ago

Caching would only help to keep the context around, but caching would only be needed if it still ultimately needs to read and process that cached context again.

Retr0id10mo ago

You can cache the whole inference state, no?

They don't go into implementation details but Gemini docs say you get a 75% discount if there's a context-cache hit: https://cloud.google.com/vertex-ai/generative-ai/docs/contex...

_heimdall10mo ago

2 more replies

Too10mo ago

ilaksh10mo ago

csomar10mo ago

My understanding is that caching reduce computation but the whole input is still processed. I don’t think is fully disclosing how their cache works.

LLMs degrade with long input regardless of caching.

stpedgwdgfhgdd10mo ago

Compact the conversation (CC)

vntok10mo ago

> Production systems need 99.9%+ reliability

Business Process Automation via AI doesn't need to be perfect. It simply needs to be sufficiently better than the status quo to pay for itself.

navane10mo ago

It's not just about up time. If the bridge collapses people die. Some of us aren't selling ads.

vntok10mo ago

GeneralMayhem10mo ago

1 more reply

Pasorrijer10mo ago

I think you're crossing reliability and availability.

Reliability means 99.9% of the time when I hand something off to someone else it's what they want.

Availability means I'm at my desk and not at the coffee machine.

Humans very much are 99.9% accurate, and my deliverable even comes with a list of things I'm not confident about

stavros10mo ago

An interesting comment I read in another post here is that humans aren't even 99.9% accurate in breathing, as around 1 in 1000 breaths requires coughing or otherwise cleaning the airways.

seadan8310mo ago

I would say reliability is availability times accuracy.

(Your point remains largely the same, just more precise with the updated definition replacing 'reliable' with 'accurate'.)

vntok10mo ago

> Humans very much are 99.9% accurate

satyrun10mo ago

Yea exactly.

99.99% is just absurd.

seadan8310mo ago

I wouldn't take the claim to mean that humans universally have an attribute called "accuracy" that is uniformly set to the value 99.9%.

vrighter10mo ago

This comment makes the assumption that the software is cloud based and all that matters is uptime.

Now we only had about 10k customers at the time. Imagine if it were millions.

lexicality10mo ago

hansmayer10mo ago

lerchmo10mo ago

lukaslalinsky10mo ago

actinium22610mo ago

an0malous10mo ago

> Clearly we have some sort of goal-based self-correction mechanism.

I think there are still a few theoretical breakthroughs needed for LLMs to achieve AGI and one of them is "active learning" like this.

cosmic_cheese10mo ago

More specific to agents, humans can also figure out how to use tools on the fly (even in the absence of documentation) where LLMs need human-built MCPs. This is also a significant limiting factor.

tkz131210mo ago

1 more reply

Vetch10mo ago

airstrike10mo ago

100% and it seems like we need a whole new architecture to get there, because right now training a model takes so much time.

bot40310mo ago

Maybe you're on to something. We need AI lions which will eat the models which don't learn or adapt enough.

1 more reply

psadri10mo ago

You could instruct the LLM to formulate a “lesson” based on the error and add this to the tool instructions for future runs.

ch4s310mo ago

This isn’t practical at scale. You’ll run into too many novel lessons and burn through too many tokens setting up context.

1 more reply

bwfan12310mo ago

Even for software applications like the Linux kernel, there would have been a theory in Linus' head - for example of what an operating system is, and how it should work.

chubot10mo ago

I believe there was an article/paper in the last few months about that exact issue

Someone was saying that with an increasing number of attempts, or increasing context length, LLMs are less and less likely to solve a problem

(I searched for it but can't find it)

That matches my experience -- the corrections in long context can just as easily be anti-corrections, e.g. turning something that works into something that doesn't work

---

Actually it might have been this one, but there are probably multiple sources saying the same thing, because it's true:

Context Rot: How Increasing Input Tokens Impacts LLM Performance - https://news.ycombinator.com/item?id=44564248

---

As far this question: how do we manage to create things like the Linux kernel or the Mars landers without AI

It's because human intelligence is a totally different thing than LLMs (contrary to what interested people will tell you)

That is, I would not expect AI to create anything like the Linux kernel. The burden of proof is on the people who claim that, not the other way around !!!

actinium22610mo ago

I saw your edit with the paper, but when you first mentioned it I thought you might have been referring to the Apple paper that more or less said the same thing.

corimaith10mo ago

Humans aren't 100% reliable but we can build tools that are 100% reliable to verify our predictions.

YeGoblynQueenne10mo ago

Hey, maybe humans aren't just like LLMs after all.

Dachande663OP10mo ago

OP here. I posted this, this morning and then promptly forgot about it. How come the title has been changed from the blog posts own?

wrp10mo ago

That was annoying. Saw the post, then later had a hard time finding it again.

hoverbot10mo ago

nextworddev10mo ago

Actually, author should be bullish on autonomous agents considering 90% of what he's even able to do now wasn't even possible in early 2024, so you shouldn't bet against the slope of progress

infecto10mo ago

Link does not work for me but as someone who does a lot of work with LLMs I am also betting against agents.

globular-toast10mo ago

I think in general if everyone is talking about a solution and nobody is talking about problems then it's a sign we're in a bubble.

georgeplusplus10mo ago

satyrun10mo ago

The HN fallacy that is a large % of the posts on AI

"AI is not good for what I do, therefore AI is useless"

apwell2310mo ago

not quite sure what you are proposing here. what exactly is AI agent solving in this example?

I keep hearing vague stuff exactly like your comment at work from management. Its so infuriating.

1 more reply

A4ET8a8uTh0_v210mo ago

<< I also don’t understand why so many large companies have focused time around it. They are not going to be cracking the code ahead of a commercial tool or open source project.

I think it is a mix of fomo and the 'upside' potential of being able to minimize ( ideally remove ) the expensive "human component". Note, I am merely trying to portray a specific world model.

At the same time, committee bickers over minute change to a process that has effectively no impact on anything of value.

Bonkers.

dickersnoodle10mo ago

>I think it is a mix of fomo and the 'upside' potential of being able to minimize ( ideally remove ) the expensive "human component". Note, I am merely trying to portray a specific world model.

IOW, it's a case of C-suite "monkey see, monkey do" kicked off by management consultants with crap to sell for very high prices...

johnisgood10mo ago

I have no idea what agents are for, could be my own ignorance.

That said, I have been using LLMs for a while now with great benefit. I did not notice anything missing, and I am not sure what agents bring to the table. Do you know?

ivape10mo ago

johnisgood10mo ago

Exactly, thank you.

What I am doing is definitely manual, it is the old-fashioned prompt-copy-paste-test-repeat cycle, but it has been educational.

stavros10mo ago

I will join you in the fight against "agentic". Ridiculous.

mhog_hn10mo ago

An agent is an LLM + a tool call loop - it is quite a step up in terms of value in my experience

jsemrau10mo ago

Agents are more than that.

Agents, besides tool use, also have memory, can plan work towards a goal, and can, through an iterative process (Reflect - Act), validate if they are on the right track.

1 more reply

infecto10mo ago

Not a disagreement with you but wanted to further clarify.

1 more reply

johnisgood10mo ago

What is the use case? What does it solve exactly, or what practical value does it give you? I am not sure what a tool call loop is.

5 more replies

JKCalhoun10mo ago

Link is working for me — perhaps it was not 30 minutes ago? (Safari, MacOS)

wooque10mo ago

[flagged]

infecto10mo ago

exe3410mo ago

> “we’re building X to reduce customer support staff by 20%,”

1 more reply

figassis10mo ago

sfink10mo ago

apwell2310mo ago

Agreed with your annoyance at "they are replacing you" comments. like duh. Thats what they've been doing forever.

oceanparkway10mo ago

afro8810mo ago

> Error rates compound exponentially in multi-step workflows. 95% reliability per step = 36% success over 20 steps. Production needs 99.9%+.

The math works out differently, depending on how well it can collect automated feedback it is doing what you want.

whazor10mo ago

For context, relevant information from steps can be cherrypicked to next stage.

The math works differently because AI (mostly) ignores irrelevant results. So steps actually increase reliability overall.

ankit21910mo ago

majormajor10mo ago

> Many tasks have easier verifications than doing the task.

In the software world (like the article is talking about) this is the logic that has ruthlessly cut software QA teams over the years. I think quality has declined as a result.

Verifiers are hard because the possible states of the internal system + of the external world multiply rapidly as you start going up the component chain towards external-facing interfaces.

---

ankit21910mo ago

For non software world, people use majority voting most of the time.

eska10mo ago

ankit21910mo ago

throwaway42334210mo ago

Is it reasonable to assume the five generations are independent?

ankit21910mo ago

jackblemming10mo ago

hannofcart10mo ago

(End quote)

Isn't this just wrong? Isn't the author conflating accuracy of LLM output in each step to accuracy of final artifact which is a reproducible deterministic piece of code?

Either am missing something or this does not seem well thought through.

vrighter10mo ago

coliveira10mo ago

And you mention testing, which certainly can be done. But when you have a large product and the code generator is unreliable (which LLMs always are), then you have to spend most of your time testing.

hungryhobbit10mo ago

Did you even finish the article? The end is all about the trade-off of when "a person in the middle is going to intervene".

lmeyerov10mo ago

I used to believe the error rate fallacy, but:

1. Multi-turn agents can correct themselves with more steps, so the reductive error cascade thinking here is more wrong than right in my experience

Similar to infra as code, CI, and many other automation processes, there's mountains of work that isn't being done and LLMs can do entirely or large swathes of

vrighter10mo ago

thedudeabides510mo ago

See "Campbell's Completeness Conjecture" https://www.campbellramble.ai/p/dont-trust-machines

constantcrying10mo ago

No, it is not "mathematically impossible". It is empirically implausible. There is no statement in mathematics that says that agents can not have a 99.999% reliability rate.

I actually think the article makes some good points, but especially when you are making good points it is unnecessary to stretch credibility with exaggerating your arguments.

macleginn10mo ago

constantcrying10mo ago

It's about what you do with errors. If you let them compound they lead to destruction, if instead you inspect, maintain, reinspect, replace, etc. you can manage them.

My point was that something extremely complex, like a plane, works, because the system tries hard to prevent compounding errors.

sarchertech10mo ago

That works because each plane is (nearly) exactly the same as the one before it and we have exact specifications for the plane.

You can do maintenance, inspections, and replacement because of those specifications.

In software the equivalent of blueprints is code. The room for variation outside software “specifications” is infinite.

Human reliability when comes to assembling planes is also much higher than 99%, and LLM reliability creating code is much, much lower than 99%.

1 more reply

john_minsk10mo ago

So until these techniques are baked into the model by OpenAI, you have to come up with these ideas yourself.

deadbabe10mo ago

I just want someone to give me one legit use case where an AI Agent now enables them to do something that couldn’t be done before, and actually makes an impact on overall profit.

stavros10mo ago

I can write code I wouldn't have been bothered to before, and make money from it.

Xmd5a10mo ago

>A database query might return 10,000 rows, but the agent only needs to know "query succeeded, 10k results, here are the first 5." Designing these abstractions is an art.

It seems the author never used prompt/workflow optimization techniques.

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow https://arxiv.org/pdf/2501.16673

esac10mo ago

This is exactly right! I'm happy people are starting to care about compounding errors, we use the sigma terminology: https://www.silverstream.ai/blog-news/2sigma

Agents are digital manufacturing machines and benefit from the same processes we identified for reliability in the real world

snappr02110mo ago

The alternative is building Functional Intelligence process flows from the ground up on a foundation of established truth?

If 50% of training data is not factually accurate, this needs to be weeded out.

Some industries require a first principles approach, and there are optimal process flows that lead to accurate and predictable results. These need research and implementation by man and machine.

wrp10mo ago

hemantv10mo ago

Llm are great reflections. Issues I have come across too large of context confuse the llm.

Second since llm are non deterministic in nature how do you know if the quality went from 90% to 30% there is no test you can write. What if model provider degrades quality you have no test for it

rudderdev10mo ago

a_bonobo10mo ago

eska10mo ago

physicsguy10mo ago

That's not really much different to writing a REST-ful interface over the top, and sticking an OpenAPI compliant schema on the front.

dmezzetti10mo ago

It's clear that what we currently call AI is best suited for augmentation not automation. There are a lot of productivity gains available if you're willing to accept that.

andrekandre10mo ago

  > AI is best suited for augmentation not automation.

i agree with this sentiment, but with the caveat of "when its not lying to you".

the most frustrating part of these interactive ai assistants is when it sends me down a rabbit hole of an api that doesn't exist (but looks almost right)

johndhi10mo ago

From what I understand customer support chatbots have had some pretty good outcomes from ai agents. Or does that not count?

nsypteras10mo ago

yunyu10mo ago

"Your fancy AI scaffolds will be washed away by scale." - Noam Brown

arendtio10mo ago

Apocryphon10mo ago

Building shovels during a fool’s gold rush, nice

swyx10mo ago

> The Mathematical Reality No One Talks About

literally everybody talks about this lmao what are you on about https://www.youtube.com/watch?v=d5EltXhbcfA

arisAlexis10mo ago

"forever"?

rco878610mo ago

But still not once have I seen an actual agent in the wild doing concrete work.

A “No True Agent” problem if you will.

iamjackg10mo ago

Technically speaking, Claude Code is an agent, for example. It's just a fancy term for an LLM that can call tools in a loop until it thinks it's done with whatever it was tasked to do.

ChatGPT's Deep Research mode is also an agent: it will keep crawling the web and refining things until it feels it has enough material to write a good response.

tomhow10mo ago

[stub for offtopicness]

Simon_O_Rourke10mo ago

Don't tell management about this, as they're all betting the house on AI agents next year.

pmg10110mo ago

Only one of these outcomes will be correct, so worth putting money on it if you think they're wrong a la The Big Short.

DavidPiper10mo ago

Not OP, but I've been thinking about this and concluded it's not quite so clear-cut. If I was going to go down this path, I think I would bet on competitors, rather than against incumbents.

1. AI Agents do deliver just enough, and stay around long enough, for big corporations to lay off large number of employees

2. After doing so, AI quickly becomes prohibitively expensive for the business

3. The combination of the above factors tank business productivity

Don't get me wrong though, even if there's no opportunity to actually bet against that situation, it will still suck for literally everyone if it eventuates.

2 more replies

Quarrelsome10mo ago

exe3410mo ago

Do you have suggestions on how one would go about doing this? Do you just approach a betting company and make some prediction against some wager?

ptero10mo ago

1 more reply

immibis10mo ago

trentnix10mo ago

They're just following the herd.

paradite10mo ago

This is obviously AI generated, if that matters.

And I have an AI workflow that generates much better posts than this.

Retr0id10mo ago

paradite10mo ago

"Hard truth" and "reality check" in the same post is dead giveaway.

I read and generate hundreds of posts every month. I have to read books on writing to keep myself sane and not sound like an AI.

2 more replies

delis-thumbs-7e10mo ago

I wonder why a person from Bombay India might use AI to aid with an English language blog post…

Perhaps more interesting is whether their argument is valid and whether their math is correct.

jrexilius10mo ago

Or it's possible that he is one of those people that _realy_ adopted LLMs into _all_ their workflow, I guess, and he thinks the output is good enough as is, because it captured his general points?

LLMs have certainly damaged trust in general internet reading now, that's for sure.

paradite10mo ago

I am not pro or against AI-generated posts. I was just making an observation and testing my AI classifier.

fleebee10mo ago

The graphs don't line up. I'm inclined to believe they were hallucinated by an LLM and the author either didn't check them or didn't care.

Judging by the other comments this is clearly low-effort AI slop.

> LLMs have certainly damaged trust in general internet reading now, that's for sure.

I hate that this is what we have to deal with now.

1 more reply

kerkeslager10mo ago

Real question: what's the best way to short AI right now?

arealaccount10mo ago

Just short any of the publicly traded companies with AI based valuations? Nvida, Meta? Seems like an awful idea but I'm often wrong.

kerkeslager10mo ago

Nvidia and Meta are both involved in a lot more than AI. There are maybe other reasons to short Meta, but either is definitely not a pure AI play.

stavros10mo ago

I mean, I wouldn't bet against AI, but I'm also not certain the current AI company valuations are realistic.

raincole10mo ago

> In a Nutshell

From another post on the same site. [0]

Yup, slop.

[0]: https://utkarshkanwat.com/writing/review-of-coding-tools/

d4rkn0d3z10mo ago

"Let's do the math. "

This phrase is usually followed by some, you know...Math?

Gigachad10mo ago

The article is slop. That’s just a phrase ChatGPT uses a lot.

cmsefton10mo ago

2015? The title should be 2025.

RustyRussell10mo ago

2015? Title is correct, this is a typo

tomhow10mo ago

Sorry about that, my fault, moderating from my phone.

rvz10mo ago

Let's get a timer to watch this fall off the front page of HN in minutes.

"We can't allow this post to create FUD about the current hype on AI agents and we need the scam to continue as long as possible".

saadatq10mo ago

we need a flag button for “written by AI”.

I’m at this stage where I’m fine with AI generated content. Sure, the verbosity sucks - but there’s an interesting idea here, but make it clear that you’ve used AI, and show your prompts.

vntok10mo ago

Generally speaking, low quality posts don't spend too much time on the front page, regardless of their topic.

rvz10mo ago

... and it's gone. Stopped the timer on 2 hours and 38 mins.

roschdal10mo ago

AI is for people without natural intelligence.

satyrun10mo ago

Yea just average IQ like Terence Tao.

All you are really saying with this comment is you have an incredibly narrow set of interests and absolutely no intellectual curiosity.

bboygravity10mo ago

So it's for 90+ percent of society?

Sounds like good business to me.

block_dagger10mo ago

Downvotes are for comments like yours

digitcatphd10mo ago

I’m sure most of the problems cited in this article will be easily solved within the next five years or so, waiting for perfection and doing nothing won’t pay dividends

atomon10mo ago

Is the main point “let me mathematically prove that it’s impossible to do what I’ve already done 12 times this year?”

j / k navigate · click thread line to collapse