What makes 5% of AI agents work in production? (opens in new tab)

(motivenotes.ai)

126 pointsAnhTho_FR5mo ago121 comments

121 comments

>This Monday, I moderated a panel in San Francisco with engineers and ML leads from Uber, WisdomAI, EvenUp, and Datastrato. The event, Beyond the Prompt, drew 600+ registrants, mostly founders, engineers, and early AI product builders.

>We weren’t there to rehash prompt engineering tips.

>We talked about context engineering, inference stack design, and what it takes to scale agentic systems inside enterprise environments. If “prompting” is the tip of the iceberg, this panel dove into the cold, complex mass underneath: context selection, semantic layers, memory orchestration, governance, and multi-model routing.

I bet those four people love that the moderator took a couple notes and then asked ChatGPT to write a blog post.

As always, the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

stingraycharles5mo ago

Yeah, “here’s the reality check:”, “not because they’re flashy, but because they’re blah blah”.

Why can’t anyone be bothered anymore to write actual content, especially when writing about AI, where your whole audience is probably already exposed to these patterns in content day in, day out?

It comes off as so cheap.

mccoyb5mo ago

It comes off as someone who lives their life according to quantity, not quality.

The real insight: have some fucking pride in what you make, be it a blog post, or a piece of software.

4 more replies

alexchantavy5mo ago

Yeah it bugs me. We've got enough examples in this article to make Cards Against Humanity ChatGPT edition

> One panelist shared a personal story that crystallized the challenge: his wife refuses to let him use Tesla’s autopilot. Why? Not because it doesn’t work, but because she doesn’t trust it.

> Trust isn’t about raw capability, it’s about consistent, explainable, auditable behavior.

> One panelist described asking ChatGPT for family movie recommendations, only to have it respond with suggestions tailored to his children by name, Claire and Brandon. His reaction? “I don’t like this answer. Why do you know my son and my girl so much? Don’t touch my privacy.”

3 more replies

rapind5mo ago

> Why can’t anyone be bothered anymore to write actual content

The way I see it is that the majority of people never bothered to write actual content. Now there’s a tool the non-writers can use to write dubious content.

I would wager this tool is being used much differently by actual writers focused on producing quality. There’s just way less of them, same way there is less of any specialization.

The real question with AI to me is whether it will remain consistently better when wielded by a specialist who has invested their time into whatever the thing is they are producing. If that ever changes then we are doomed. When it’s no longer slop…

2 more replies

tkgally5mo ago

I started to suspect a few paragraphs in that this post was written with a lot of AI assistance, but I continued to read to the end because the content was interesting to me. Here's one point that resonated in particular:

"There’s a missing primitive here: a secure, portable memory layer that works across apps, usable by the user, not locked inside the provider. No one’s nailed it yet. One panelist said if he weren’t building his current startup, this would be his next one."

donnaoana5mo ago

thanks, I used AI but aren't we all? I thought the point of AI is to get us to be more productive. But that's only after I came up with the questions for the speakers and I wrote a draft of the blog, and the penelists read it, added comments and I published. It seems I get a lot of hate here for it, but I am happy with the number of engineers and founders sharing feedback that this was useful to them. I'm not forcing anyone to read my content, but if people want to put the time to hate on it, it's their choice.

1 more reply

ares6235mo ago

Isn’t that markdown files?

2 more replies

esperent5mo ago

> the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

This isn't true. I've been using Gemini 2.5 a lot recently and I can't get it to stop adding links!

I added custom instructions: Do not include links in your output. At the start of every reply say "I have not added any links as requested".

It works for the first couple of responses but then it's back to loads of links again.

donnaoana5mo ago

thanks for the hate, they did love it indeed, the questions I've asked them, the draft I wrote for them to read, and published only after they read and added comments. I am curious, do you not use AI? isn't the point to polish things and make it more efficient? I am curious if there was anything useful to you in the article or if you have constructive criticism? I was sad to read some of the hate, but overall, I am very happy with the many notes form founders and builders who found it useful.

carimura5mo ago

the future is now where debates about human vs machine will influence our trust and enjoyment! I read the article wondering how much of it was AI generated (new worry!), but also how biased it was based on the authors startup business interest (old worry!), and concluded that if I learned something about the panel it was worth the 5 minutes. Or maybe 2 minutes if an AI summarized it.

geoffbp5mo ago

And the Oxford comma

collingreen5mo ago

Nooooo I believe in the oxford comma don't let them drag it down! :(

scotty795mo ago

It did good enough job for me to skim it.

thisisit5mo ago

It seems to me that people think AI is somehow magic. Recently I led a product demo. The conversation went something like this:

End users (at my company) - Can your AI system look at numbers and find differences and generate a text description?

Pre-sales - (trying to clarify) For our systems to generate text it will be better if you give it some live examples so that it understands what text to generate.

End users - But there is supporting data (metadata) around the numbers. Can't your AI system just generate text?

Pre-Sales - It can but you need to provide context and examples. Otherwise it is going to generic text like "there is x difference".

End user - You mean I need to write comments manually first? That is too much work.

Now these users have a call with another product - MS Copilot.

beezlebroxxxxxx5mo ago

Well, you hear a lot about how AI will "empower" employees and generate new "insights" based off of data for analysts and execs. In reality, most executives aren't really interested in that. They'd like it for sure, but really what they want is automation. They want "efficiencies"; they want cost cutting.

Anyone that's been involved in data science roles in corporate environments knows that "the data" is usually forced into an execs pre-existing understanding of a phenomenon. With AI, execs are really excited at "cutting out the middlemen" when the middlemen in the equation are very often their own paid employees. That's all fine and dandy in an abstract economic view, but it's sure something they won't say publicly (at least most won't).

In terms of potential cost cutting, it probably is the most recent "new magic". You used to have to pay a consultant, now you can "ask AI".

nowittyusername5mo ago

This is a very common sentiment I see everywhere and it really highlights how uneducated most people are about technology in general. Most folks seem to expect things to work magically and perform physics breaking feats and it honestly baffles me. I would expect this attitude from maybe the younger generations who grew up only being users of technology like tablets and smartphones, but I honestly never expected millennials to be in the same camp, but nope they are just as ignorant. And I am thinking to myself, did I grow up different? Were my friends also not using the same Nintendo cartridges, and VCR's and camcorders and all the other tech that you had no choice but to learn at least basic fundamentals to use? Apparently most people never delved deeper then surface level on how to use these things and everything else went right over their head...

__s5mo ago

Vonnegut in On Writing Science Fiction reflected on Player Piano being labeled sci-fi since it involved machines, "The feeling persists that no one can simultaneously be a respectable writer and understand how a refrigerator works, just as no gentleman wears a brown suit in the city"

TheHegemon5mo ago

> Apparently most people never delved deeper then surface level on how to use these things and everything else went right over their head...

This is really the truth of all things in life.

nitwit0055mo ago

Plenty of people have a story of managers asking them to do impossible or nonsensical things. It should be unsurprising people will do the same with a machine.

bluefirebrand5mo ago

> Most folks seem to expect things to work magically and perform physics breaking feats and it honestly baffles me

This is how it is being marketed and I guess people are silly enough to believe marketing so it's not too surprising

alganet5mo ago

> It seems to me that people think AI is somehow magic.

That's because it is marketed as magic. It's marketed as magic so people will adopt the thing before knowing its shortcomings.

https://pbfcomics.com/comics/the-masculator/

hadlock5mo ago

The MS Copilot pre-sales person responded "oh, there is metadata? then yes, it will discover that and generate a text description, no problem"

alansaber5mo ago

TBF synthetic data generation exists for this reason. I do understand why a lot of companies go with the "safe" choice (copilot) even though it's crap.

amenhotep5mo ago

Pray, Mr Babbage, etc

iagooar5mo ago

Wow, half of this article deeply resonates with what I am working on.

Text-to-SQL is the funniest example. It seems to be the "hello world" of agentic use in enterprise environments. It looks so easy, so clear, so straight-forward. But just because the concept is easy to grasp (LLMs are great at generating markup or code, so let's have them translate natural language to SQL) doesn't mean it is easy to get right.

I have spent the past 3 months building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries. And boy oh boy is that rabbit hole deep.

data-ottawa5mo ago

SQL is never just the tables and joins, it’s knowing the table grains, the caveats, all the modelling definitions and errors (and your data warehouse almost certainly has modelling errors as business logic in your app drifts), plus the business context to correctly answer questions.

60% of the time I spend writing sql is probably validation. A single hallucinated assumption can blow the whole query. And there are questions that don’t have clear modelling approaches that you have to deal with.

Plus, a lot of the sql training data in LLMs is pretty bad, so I’ve not been impressed yet. Certainly not to let business users run an AI query agent unchecked.

I’m sure AI will get good at this, so I’m building up my warehouse knowledge base and putting together documentation as best I can. It’s just pretty awful today.

jamesblonde5mo ago

Text2SQL was 75% on bird-bench 6 months ago. Now it's 80%. Humans are still at 90+%. We're not quite there yet. I suspect text-to-sql needs a lot of intermediate state and composition of abstractions, which vanilla attention is not great at.

https://bird-bench.github.io/

ares6235mo ago

Text to sql is solved by having good UX and a reasonable team that’s in touch with the customers needs.

A user having to come up with novel queries all the time to warrant text 2 sql is a failure of product design.

2 more replies

impossiblefork5mo ago

There were some iffy things about the text to SQL datasets though, historically.

People got good results on the test datasets, but the test datasets had errors so the high performance was actually just the models being overfitted.

I don't remember where this was identified, but it's really recent, but before GPT-5.

juleiie5mo ago

> building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries

Wait but this just sounds unhinged, why oh why

pbronez5mo ago

The problem is that precision is expensive. Writing is thinking. Writing software is defining the business problem.

People don't know exactly what they want from the data warehouse, just a fuzzy approximation of it. You need stochastic software (AI) to map the imprecise instructions from your users to precise instructions the warehouse can handle.

donnaoana5mo ago

glad it resonates, that was the intention

another_twist5mo ago

So I have read the MIT paper and the methodology as well as the conclusions are just something else.

For example, the number comes from perceived successes and failures and not actual measurements. The customer conclusions are also - it doesnt improve or it doesnt remember. Literally buying into the hype of recursive self improvement and completely oblivious to the fact that API dont control model weights and such cant do much self improvement besides writing more CRUD layers. The other complaints are about integrations which are totally valid. But in industries which still run windows XYZ without any API platforms so thats not going away in those cases.

Point being, if the paper itself is not very good discourse just a well marketed punditry, why should we discuss on the 5% number. It makes no sense.

monero-xmr5mo ago

A non-open ended path collapses into a decision tree. Very hard to think of customer support use-cases that do not collapse into decision trees. Most prompt engineering on the SaaS side results in very long prompts to re-invent decision trees and protect against edge cases. Ultimately the AI makes a “decision function call” which hits a decision tree. LLM is very poor replacement for a decision tree.

I use LLM every day of my life to make myself highly productive. But I do not use LLM tools to replace my decision trees.

LPisGood5mo ago

It just occurred to me that with those massive system files people use we’re basically reinventing expert systems of the past. Time is a flat circle, I suppose.

schrodinger5mo ago

A decision tree is simply a model where you follow branches and make a decision at each point. Like...

If we had tech support for a toaster, you might see:

    if toaster toasts the bread:
      if no: has turning it off and on again worked?
        if yes: great! you found a solution
        if no: hmm, try ...
      if yes:
        is the bread burnt after?
          if no: sounds like your toaster is fine!
          if yes: have you tried adjusting the darkness knob?
            if no: ship it in for repair
            if yes: try replacing the timer. does that help?
              if no: ship it in for repair
              if yes: yay you're toaster is fixed

LostMyLogin5mo ago

Any chance you can ELI5 this to me?

3 more replies

jongjong5mo ago

It's interesting because my management philosophy when delegating work has been to always start by telling people what my intent is, so that they don't get too caught up in a specific approach. Many problems require out-of-the-box thinking. This is really about providing context. Context engineering is basically a management skill.

Without context, even the brightest people will not be able to fill in the gaps in your requirements. Context is not just nice-to-have, it's a necessity when dealing with both humans and machines.

I suspect that people who are good engineering managers will also be good at 'vibe coding'.

HardCodedBias5mo ago

"I suspect that people who are good engineering managers will also be good at 'vibe coding'."

I have observed that those who have both technical and management experience seem to be more adept (or perhaps willing?) to use LLMs in the daily life to good effect.

Of course what really helps, like in all things, is conscientiousness and an obsession for working through problems (if people don't like obsession then tenacity and diligence).

AdieuToLogic5mo ago

It's funny that what the author identifies as "the reality check":

  Here’s the reality check: One panelist mentioned that 95%
  of AI agent deployments fail in production. Not because the 
  models aren’t smart enough, but because the scaffolding 
  around them, context engineering, security, memory design, 
  isn’t there yet.

Could be a reasonable definition of "understanding the problem to solve."

In other words, everything identified as what "the scaffolding" needs is what qualified people provide when delivering solutions to problems people want solved.

whatever15mo ago

They fail because the “scaffolding” is building the complicated expert system that AI promised that one would not have to do.

If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middle altogether.

AdieuToLogic5mo ago

> If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middles altogether.

Well said and I could not agree more.

moduspol5mo ago

> If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middle altogether.

You might even be able to put a UI on it that is a lot more effective than asking the user to type text into a box.

nylonstrung5mo ago

Very interesting that you've found ways to mitigate the hallucination issue. Are you able to share more about what worked for you with the post processor and parser?

mnky9800n5mo ago

You see, in order to get the AI agent to do it's job, we needed to write a lot of software to provide it with guard rails so that it doesn't lose its mind when doing so.

might as well just write the ai agent part of the software yourself as well.

codyb5mo ago

At work we're deploying a chat bot to help users with our internal tools and it's just a forcing function to write and mark as deprecated the documentation we never maintained in the first place.

So...

The bot, to its credit, returns some decent results. But my guess is that it will be quite a while before we see it in prod since a lot of these projects go from 0 - 80% in a week and 80% - deployable in several years.

danieltanfh955mo ago

It is really just BS. These are just basic DSA stuff. We deployed a real world solution by doing of all of that on our side. It's not magic. It's engineering.

ares6235mo ago

At some point, say 5 years from now, someone will revisit their AI-powered production workloads and ask the question "how can we optimize this by falling back to non-AI workload?". Where does that leave AI companies when the obvious choice is to do away with their services once their customers reach a threshold?

anonzzzies5mo ago

A lot of what we encounter is; there is this 'chat' interface which is the 'wow factor': you type something in english and something (like text to sql) falls out, maybe 60-80% of what was needed. But then the frustration (for the user) starts: the finetuning of the result. After a few uses, they always ask for the 'old way' back to do that: just editing the query or give them knobs to turn to finetune the result. Where most want knobs which are, outside the most generic cases (pick a timespan for a datetime column), custom work. So AI is used for the first 10% of the work time (which gives you 60%+ of the solution) until the frustration lands: the last 40% or less are going to take 90% of your time. Still great as overall it will probably take far less time than before.

EdwardDiego5mo ago

"Huh, turns out we could replace it all with a 4 line Perl script doing linear regression."

ares6235mo ago

“How I used ancient programming techniques to save the company $100k/year in token costs”

1 more reply

marcosdumay5mo ago

Those 5% that generate revenue on the MIT article do that because the only thing they are used for is creating marketing spam to send to people.

And now we have an entire panel of bullshitters with an article-long theory about how to make LLMs program actually for real this time.

(Oh, and it would be great if journalists actually cited their public sources, instead of pretending they link to the article but actually linking to their review of related content.)

intended5mo ago

I just refuse to read long AI generated text. Sadly this feels exactly like that.

donnaoana5mo ago

I am curious, how would you use AI then, if not to make one more productive? The text is not AI generated, I came up with the questions, moderated the discussion, wrote a draft that the speakers red, added comments and the AI polished it, the AI was a custom GPT that I trained on my previous text from that substack. I am curious what would you have done differently or if you would refuse to use AI at all? I wrote the article, so I am genuinly curious. I didn't know someone posted on Hacker News, I knew people like to be negative here because there is no accountablity, I want to learn from all this hate. I am personally happy with the outcome, I gover over 30 notes from people who are building that this was useful to them, and the speakers were happy. So I am curious what could have I done differently from your perspective or what should be my learning from all these people who take time from their day to hate on this piece of writing instead of deciding not to read and moving on.

intended5mo ago

Hey it’s your call! As you said it’s your productivity.

If you said it’s something you made for perusal and reading? Then it reads like AI.

I’ve had to read tons of papers and articles, the most testing being conference submissions. I won’t read something with that structure unless I have to.

codyb5mo ago

I get really frustrated when I see it on PRs cause it's such a time sink, super obvious, and so fluffy.

So you scaffold this up in 30 seconds but want me to read through it carefully? Cool, thanks.

EdwardDiego5mo ago

> One team suggested that instead of text-to-SQL, we should build semantic business logic layers, “show me Q4 revenue” should map to a verified calculation, not raw SQL generation.

Okay, how would that work though? Verified by who and calculated by what?

I need deets.

meheleventyone5mo ago

They're saying that someone should implement the CalculateQuarterRevenue(year, quarter) function somewhere in a manner that has been verified (e.g. run it against previous quarters to make sure it works correctly) then rather than using the LLM to generate SQL you use it to decide what domain function should be called. Which to me seems to mean that someone on the panel was gently taking the piss out of the idea. Since if you've done all the hardwork anyway presenting this in a deterministic way with a nice UX is straightforward bit of front end work.

moduspol5mo ago

It also removes a lot of the value of the LLM. They're perceived as being smart, and the interface (open-ended text) implies they are capable of more than executing pre-defined functions.

So if you have a "CalculateQuarterRevenue(year, quarter)" function, you'll soon find your users asking for the data per-month. Or just for the last six weeks. Or just for a specific client. And they'll be confused when it doesn't work.

1 more reply

esafak5mo ago

In other words, there should be a list of predefined queries, or possibly subqueries, that the user can request. This is basically how products used to work before AI. The difference is now you can request which query you want verbally.

edit: I'm serious. I'm just answering the question, not making a value judgement.

lesuorac5mo ago

I assume you're being tongue in check but I've watched a lot of people use software and they really just don't know anything about it. Being able to verbally request something is something they can learn to do while googling how do I normalize the scores in my rubric to add up to 100 is something they couldn't.

Verbal queries is the solution for the world we have even if it's not optimal.

1 more reply

thr0w5mo ago

So simple classification problem. Big deal.

dchftcs5mo ago

On one side, you have an agent calculating the revenue.

On the other side, you have an SQL that calculates the revenue

Compare the two. If the two disagree, get the AI to try again. If the AI is still wrong after 10 tries, just use the SQL output.

mnky9800n5mo ago

so you have an answer and then you throw compute at trying to produce the answer in a different way.

What I hear is a billion dollar AI startup in the making!

1 more reply

tirumaraiselvan5mo ago

A simple way is perhaps implement a text-to-metrics system where metrics could be defined as SQL functions.

moomoo115mo ago

psychedelics

zoeey5mo ago

I've always felt the real challenge isn't the LLM itself, but managing the context around it. Many people assume that writing a good prompt is enough, but the real work is turning something unpredictable into a tool you can actually rely on.

LogicFailsMe5mo ago

95% of the talent is being paid top dollar to build ~5% of the applications?

alansaber5mo ago

Absolutely, when we're talking about infrastructure versus model development (RL/fine tuning, let alone pre-training).

hn_throwaway_995mo ago

> Here’s the reality check: One panelist mentioned that 95% of AI agent deployments fail in production. Not because the models aren’t smart enough, but because the scaffolding around them, context engineering, security, memory design, isn’t there yet.

It's a big pet peeve of mine when an author states an opinion, with no evidence, as some kind of axiom. I think there is plenty of evidence that "the models aren't smart enough". Or to put it more accurately, it's an incredibly difficult problem to get a big productivity gain when an automated system is blatantly wrong ~1% of the time but when those wrong answers are inherently designed to look like right answers as much as possible.

janalsncm5mo ago

> The panel’s consensus: conversation works when it removes a learning curve.

Conversational UIs are controversial but I think there are a good number of websites where a better search could be more centric. Not generating text, but surfacing the most relevant text.

I’m thinking of a lot of library documentation, government info websites, etc. Basically an improvement over deep hierarchical navigation, where their way of organizing info is a leaky abstraction.

Maybe that will be one of the side effects of this AI boom. Who knows.

tirumaraiselvan5mo ago

This article is getting a lot of hate but honestly it does have good amount of useful content learned through practical experience, although at an abstract level. For example, this section:

``` The teams that succeed don’t just throw SQL schemas at the model. They build:

Business glossaries and term mappings

Query templates with constraints

Validation layers that catch semantic errors before execution ```

Unfortunately, the mixing of fluffy tone and high level ideas is bound to be detested by hands on practitioners.

another_twist5mo ago

Its weird that this makes the front page and Metas code world model never did.

metadat5mo ago

First I've heard of it:

https://ai.meta.com/research/publications/cwm-an-open-weight...

CuriouslyC5mo ago

HN front page dynamics are heavily driven by having readers of /new who are stans for your content.

mnky9800n5mo ago

is that an eminem reference?

1 more reply

hshdhdhehd5mo ago

Base models are the seed, fine tuning is the genetically modified seed. Context is the fertiliser.

handfuloflight5mo ago

Agents are the oxen pulling the plow through the seasons... turning over ground, following furrows, adapting to terrain. RAG is the irrigation system. Prompts are the farmer's instructions. And the harvest? That depends on how well you understood what you were trying to grow.

nylonstrung5mo ago

Locusts are when LLMs inexplicably rewrite your existing code despite in-line and prompt instructions not to

j / k navigate · click thread line to collapse

121 comments

sbierwagen5mo ago

>We weren’t there to rehash prompt engineering tips.

I bet those four people love that the moderator took a couple notes and then asked ChatGPT to write a blog post.

As always, the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

stingraycharles5mo ago

Yeah, “here’s the reality check:”, “not because they’re flashy, but because they’re blah blah”.

Why can’t anyone be bothered anymore to write actual content, especially when writing about AI, where your whole audience is probably already exposed to these patterns in content day in, day out?

It comes off as so cheap.

mccoyb5mo ago

It comes off as someone who lives their life according to quantity, not quality.

The real insight: have some fucking pride in what you make, be it a blog post, or a piece of software.

4 more replies

alexchantavy5mo ago

Yeah it bugs me. We've got enough examples in this article to make Cards Against Humanity ChatGPT edition

> One panelist shared a personal story that crystallized the challenge: his wife refuses to let him use Tesla’s autopilot. Why? Not because it doesn’t work, but because she doesn’t trust it.

> Trust isn’t about raw capability, it’s about consistent, explainable, auditable behavior.

3 more replies

rapind5mo ago

> Why can’t anyone be bothered anymore to write actual content

The way I see it is that the majority of people never bothered to write actual content. Now there’s a tool the non-writers can use to write dubious content.

I would wager this tool is being used much differently by actual writers focused on producing quality. There’s just way less of them, same way there is less of any specialization.

2 more replies

tkgally5mo ago

donnaoana5mo ago

1 more reply

ares6235mo ago

Isn’t that markdown files?

2 more replies

esperent5mo ago

> the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

This isn't true. I've been using Gemini 2.5 a lot recently and I can't get it to stop adding links!

I added custom instructions: Do not include links in your output. At the start of every reply say "I have not added any links as requested".

It works for the first couple of responses but then it's back to loads of links again.

donnaoana5mo ago

carimura5mo ago

geoffbp5mo ago

And the Oxford comma

collingreen5mo ago

Nooooo I believe in the oxford comma don't let them drag it down! :(

scotty795mo ago

It did good enough job for me to skim it.

thisisit5mo ago

It seems to me that people think AI is somehow magic. Recently I led a product demo. The conversation went something like this:

End users (at my company) - Can your AI system look at numbers and find differences and generate a text description?

Pre-sales - (trying to clarify) For our systems to generate text it will be better if you give it some live examples so that it understands what text to generate.

End users - But there is supporting data (metadata) around the numbers. Can't your AI system just generate text?

Pre-Sales - It can but you need to provide context and examples. Otherwise it is going to generic text like "there is x difference".

End user - You mean I need to write comments manually first? That is too much work.

Now these users have a call with another product - MS Copilot.

beezlebroxxxxxx5mo ago

In terms of potential cost cutting, it probably is the most recent "new magic". You used to have to pay a consultant, now you can "ask AI".

nowittyusername5mo ago

__s5mo ago

TheHegemon5mo ago

> Apparently most people never delved deeper then surface level on how to use these things and everything else went right over their head...

This is really the truth of all things in life.

nitwit0055mo ago

Plenty of people have a story of managers asking them to do impossible or nonsensical things. It should be unsurprising people will do the same with a machine.

bluefirebrand5mo ago

> Most folks seem to expect things to work magically and perform physics breaking feats and it honestly baffles me

This is how it is being marketed and I guess people are silly enough to believe marketing so it's not too surprising

alganet5mo ago

> It seems to me that people think AI is somehow magic.

That's because it is marketed as magic. It's marketed as magic so people will adopt the thing before knowing its shortcomings.

https://pbfcomics.com/comics/the-masculator/

hadlock5mo ago

The MS Copilot pre-sales person responded "oh, there is metadata? then yes, it will discover that and generate a text description, no problem"

alansaber5mo ago

TBF synthetic data generation exists for this reason. I do understand why a lot of companies go with the "safe" choice (copilot) even though it's crap.

amenhotep5mo ago

Pray, Mr Babbage, etc

iagooar5mo ago

Wow, half of this article deeply resonates with what I am working on.

I have spent the past 3 months building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries. And boy oh boy is that rabbit hole deep.

data-ottawa5mo ago

Plus, a lot of the sql training data in LLMs is pretty bad, so I’ve not been impressed yet. Certainly not to let business users run an AI query agent unchecked.

I’m sure AI will get good at this, so I’m building up my warehouse knowledge base and putting together documentation as best I can. It’s just pretty awful today.

jamesblonde5mo ago

https://bird-bench.github.io/

ares6235mo ago

Text to sql is solved by having good UX and a reasonable team that’s in touch with the customers needs.

A user having to come up with novel queries all the time to warrant text 2 sql is a failure of product design.

2 more replies

impossiblefork5mo ago

There were some iffy things about the text to SQL datasets though, historically.

People got good results on the test datasets, but the test datasets had errors so the high performance was actually just the models being overfitted.

I don't remember where this was identified, but it's really recent, but before GPT-5.

juleiie5mo ago

> building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries

Wait but this just sounds unhinged, why oh why

pbronez5mo ago

The problem is that precision is expensive. Writing is thinking. Writing software is defining the business problem.

donnaoana5mo ago

glad it resonates, that was the intention

another_twist5mo ago

So I have read the MIT paper and the methodology as well as the conclusions are just something else.

Point being, if the paper itself is not very good discourse just a well marketed punditry, why should we discuss on the 5% number. It makes no sense.

monero-xmr5mo ago

I use LLM every day of my life to make myself highly productive. But I do not use LLM tools to replace my decision trees.

LPisGood5mo ago

It just occurred to me that with those massive system files people use we’re basically reinventing expert systems of the past. Time is a flat circle, I suppose.

schrodinger5mo ago

A decision tree is simply a model where you follow branches and make a decision at each point. Like...

If we had tech support for a toaster, you might see:

    if toaster toasts the bread:
      if no: has turning it off and on again worked?
        if yes: great! you found a solution
        if no: hmm, try ...
      if yes:
        is the bread burnt after?
          if no: sounds like your toaster is fine!
          if yes: have you tried adjusting the darkness knob?
            if no: ship it in for repair
            if yes: try replacing the timer. does that help?
              if no: ship it in for repair
              if yes: yay you're toaster is fixed

LostMyLogin5mo ago

Any chance you can ELI5 this to me?

3 more replies

jongjong5mo ago

Without context, even the brightest people will not be able to fill in the gaps in your requirements. Context is not just nice-to-have, it's a necessity when dealing with both humans and machines.

I suspect that people who are good engineering managers will also be good at 'vibe coding'.

HardCodedBias5mo ago

"I suspect that people who are good engineering managers will also be good at 'vibe coding'."

I have observed that those who have both technical and management experience seem to be more adept (or perhaps willing?) to use LLMs in the daily life to good effect.

Of course what really helps, like in all things, is conscientiousness and an obsession for working through problems (if people don't like obsession then tenacity and diligence).

AdieuToLogic5mo ago

It's funny that what the author identifies as "the reality check":

  Here’s the reality check: One panelist mentioned that 95%
  of AI agent deployments fail in production. Not because the 
  models aren’t smart enough, but because the scaffolding 
  around them, context engineering, security, memory design, 
  isn’t there yet.

Could be a reasonable definition of "understanding the problem to solve."

In other words, everything identified as what "the scaffolding" needs is what qualified people provide when delivering solutions to problems people want solved.

whatever15mo ago

They fail because the “scaffolding” is building the complicated expert system that AI promised that one would not have to do.

If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middle altogether.

AdieuToLogic5mo ago

> If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middles altogether.

Well said and I could not agree more.

moduspol5mo ago

> If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middle altogether.

You might even be able to put a UI on it that is a lot more effective than asking the user to type text into a box.

nylonstrung5mo ago

Very interesting that you've found ways to mitigate the hallucination issue. Are you able to share more about what worked for you with the post processor and parser?

mnky9800n5mo ago

You see, in order to get the AI agent to do it's job, we needed to write a lot of software to provide it with guard rails so that it doesn't lose its mind when doing so.

might as well just write the ai agent part of the software yourself as well.

codyb5mo ago

At work we're deploying a chat bot to help users with our internal tools and it's just a forcing function to write and mark as deprecated the documentation we never maintained in the first place.

So...

danieltanfh955mo ago

It is really just BS. These are just basic DSA stuff. We deployed a real world solution by doing of all of that on our side. It's not magic. It's engineering.

ares6235mo ago

anonzzzies5mo ago

EdwardDiego5mo ago

"Huh, turns out we could replace it all with a 4 line Perl script doing linear regression."

ares6235mo ago

“How I used ancient programming techniques to save the company $100k/year in token costs”

1 more reply

marcosdumay5mo ago

Those 5% that generate revenue on the MIT article do that because the only thing they are used for is creating marketing spam to send to people.

And now we have an entire panel of bullshitters with an article-long theory about how to make LLMs program actually for real this time.

(Oh, and it would be great if journalists actually cited their public sources, instead of pretending they link to the article but actually linking to their review of related content.)

intended5mo ago

I just refuse to read long AI generated text. Sadly this feels exactly like that.

donnaoana5mo ago

intended5mo ago

Hey it’s your call! As you said it’s your productivity.

If you said it’s something you made for perusal and reading? Then it reads like AI.

I’ve had to read tons of papers and articles, the most testing being conference submissions. I won’t read something with that structure unless I have to.

codyb5mo ago

I get really frustrated when I see it on PRs cause it's such a time sink, super obvious, and so fluffy.

So you scaffold this up in 30 seconds but want me to read through it carefully? Cool, thanks.

EdwardDiego5mo ago

> One team suggested that instead of text-to-SQL, we should build semantic business logic layers, “show me Q4 revenue” should map to a verified calculation, not raw SQL generation.

Okay, how would that work though? Verified by who and calculated by what?

I need deets.

meheleventyone5mo ago

moduspol5mo ago

It also removes a lot of the value of the LLM. They're perceived as being smart, and the interface (open-ended text) implies they are capable of more than executing pre-defined functions.

1 more reply

esafak5mo ago

edit: I'm serious. I'm just answering the question, not making a value judgement.

lesuorac5mo ago

Verbal queries is the solution for the world we have even if it's not optimal.

1 more reply

thr0w5mo ago

So simple classification problem. Big deal.

dchftcs5mo ago

On one side, you have an agent calculating the revenue.

On the other side, you have an SQL that calculates the revenue

Compare the two. If the two disagree, get the AI to try again. If the AI is still wrong after 10 tries, just use the SQL output.

mnky9800n5mo ago

so you have an answer and then you throw compute at trying to produce the answer in a different way.

What I hear is a billion dollar AI startup in the making!

1 more reply

tirumaraiselvan5mo ago

A simple way is perhaps implement a text-to-metrics system where metrics could be defined as SQL functions.

moomoo115mo ago

psychedelics

zoeey5mo ago

LogicFailsMe5mo ago

95% of the talent is being paid top dollar to build ~5% of the applications?

alansaber5mo ago

Absolutely, when we're talking about infrastructure versus model development (RL/fine tuning, let alone pre-training).

hn_throwaway_995mo ago

janalsncm5mo ago

> The panel’s consensus: conversation works when it removes a learning curve.

Conversational UIs are controversial but I think there are a good number of websites where a better search could be more centric. Not generating text, but surfacing the most relevant text.

I’m thinking of a lot of library documentation, government info websites, etc. Basically an improvement over deep hierarchical navigation, where their way of organizing info is a leaky abstraction.

Maybe that will be one of the side effects of this AI boom. Who knows.

tirumaraiselvan5mo ago

This article is getting a lot of hate but honestly it does have good amount of useful content learned through practical experience, although at an abstract level. For example, this section:

``` The teams that succeed don’t just throw SQL schemas at the model. They build:

Business glossaries and term mappings

Query templates with constraints

Validation layers that catch semantic errors before execution ```

Unfortunately, the mixing of fluffy tone and high level ideas is bound to be detested by hands on practitioners.

another_twist5mo ago

Its weird that this makes the front page and Metas code world model never did.

metadat5mo ago

First I've heard of it:

https://ai.meta.com/research/publications/cwm-an-open-weight...

CuriouslyC5mo ago

HN front page dynamics are heavily driven by having readers of /new who are stans for your content.

mnky9800n5mo ago

is that an eminem reference?

1 more reply

hshdhdhehd5mo ago

Base models are the seed, fine tuning is the genetically modified seed. Context is the fertiliser.

handfuloflight5mo ago

nylonstrung5mo ago

Locusts are when LLMs inexplicably rewrite your existing code despite in-line and prompt instructions not to

j / k navigate · click thread line to collapse