Yes there seems to be lots of potential. Yes we can brainstorm things that should work. Yes there is a lot of examples of incredible things in isolation. But it's a little bit like those youtube videos showing amazing basketball shots in 1 try, when in reality lots of failed attempts happened beforehand. Except our users experience the failed attempts (LLM replies that are wrong, even when backed by RAG) and it's incredibly hard to hide those from them.
Show me the things you / your team has actually built that has decent retention and metrics concretely proving efficiency improvements.
LLMs are so hit and miss from query to query that if your users don't have a sixth sense for a miss vs a hit, there may not be any efficiency improvement. It's a really hard problem with LLM based tools.
There is so much hype right now and people showing cherry picked examples.
This has been my team's experience (and frustration) as well, and has led us to look at using LLMs for classifying / structuring, but not entrusting an LLM with making a decision based on things like a database schema or business logic.
I think the technology and tooling will get there, but the enormous amount of effort spent trying to get the system to "do the right thing" and the nondeterministic nature have really put us into a camp of "let's only allow the LLM to do things we know it is rock-solid at."
Even this is insanely hard in my opinion. The one thing that you would assume LLM to excel at is spelling and grammar checking for the English language, but even the top model (GPT-4o) can be insanely stupid/unpredictable at times. Take the following example from my tool:
https://app.gitsense.com/?doc=6c9bada92&model=GPT-4o&samples...
5 models are asked if the sentence is correct and GPT-4o got it wrong all 5 times. It keeps complaining that GitHub is spelled like Github, when it isn't. Note, only 2 weeks ago, Claude 3.5 Sonnet did the same thing.
I do believe LLM is a game changer, but I'm not convinced it is designed to be public-facing. I see LLM as a power tool for domain experts, and you have to assume whatever it spits out may be wrong, and your process should allow for it.
Edit:
I should add that I'm convinced that not one single model will rule them all. I believe there will be 4 or 5 models that everybody will use and each will be used to challenge one another for accuracy and confidence.
While LLMs do plenty of awful things, people make the most incredibly stupid mistakes too, and that is what LLMs needs to be benchmarked against. The problem is that most of the people evaluating LLMs are better educated than most and often smarter than most. When you see any quantity of prompts input by a representative sample of LLM losers, you quickly lose all faith in humanity.
I'm not saying LLMs are good enough. They're not. But we will increasingly find that there are large niches where LLMs are horrible and error prone yet still outperform the people companies are prepared to pay to do the task.
In other words, on one hand you'll have domain experts becoming expert LLM-wranglers. On the other hand you'll have public-facing LLMs eating away at tasks done by low paid labour where people can work around their stupid mistakes with process or just accepting the risk, same as they currently do with undertrained labor.
this gets to the heart of it for me. I think LLMs are an incredible tool, providing advanced augmentation on our already developed search capabilities. What advanced user doesnt want to have a colleague they can talk about their specific domain capacity with?
The problem comes from the hyperscaling ambitions of the players who were the first in this space. They quickly hyped up the technology beyond want it should have been.
- every time a different result is produced.
- no reasoning capabilities were categorically determined.
So this is it. If you want LLM - brace for different results and if this is okay for your application (say it’s about speech or non-critical commands) then off you are.
Otherwise simply forget this approach, and particularly when you need reproducible discreet results.
I don’t think it gets any better than that and nothing so far implicated it will (with this particular approach to AGI or whatever the wet dream is)
I have had good luck using an LLM as a "sanity checking" layer for transcription output, though. A simple prompt like "is this paragraph coherent" has proven to be a pretty decent way to check the accuracy of whisper transcriptions.
I think that, too, is a UX problem.
If you present the output as you do, as simple text on a screen, the average user will read it with the voice of an infallible Star Trek computer and be irritated by every mistake.
But if you present the same thing as a bunch of cartoon characters talking to each other, users might not only be fine with "egg in your face moments", as you put it, they will laugh about them.
The key is to move the user away from the idealistic mental model of what a computer is and does.
Leaving aside "we're" and "we are" are the same, it is absolutely active voice
I feel like this is unfair. That's the only thing it got wrong? But we want it to pass all of our evals, even ones the perhaps a dictionary would be better at solving? Or even an LLM augmented with a dictionary.
I see these statements often here about “I’ve never seen an effective commercial use of LLMs,” which tells me you aren’t working with very creative and competent people in areas that are amenable to LLMs. In my professional network beyond where I work now I know at least a dozen people who have successful commercial applications of LLMs. They tend to be highly capable people able to build the end to end tool chains necessary (which is a huge gap) and understand how to compose LLMs in hierarchical agents with effective guard rails. Most ineffectual users of LLMs want them to be lazy buttons that obviate the need to think. They’re not - like any sufficiently powerful tool they require thought up front and are easy to use wrong. This will get better with time as patterns and tools emerge to get the most use out of them in a commercial setting. However the ability to process natural language and use an emergent (if not actual) abductive reasoning is absurdly powerful and was not practically possible 4 years ago - the assertion such an amazing capability in an information or decisioning system is not commercially practical is on the face absurd.
Apps that use LLMs or apps made with LLMs? In either case can you share them?
>which tells me you aren’t working with very creative and competent people
> In my professional network beyond where I work now I know at least a dozen people who have successful commercial applications of LLMs.
Apps that use LLMs or apps made with LLMs? In either case can you share them?
No one doubts that you can integrate LLMs into an application workflow and get some benefits in certain cases. That has been what the excitement and promise was about all along. They have a demonstrated ability to wrangle, extract, and transform data (mostly correctly) and generate patterns from data and prompts (hit and miss, usually with a lot of human involvement). All of which can be powerful. But outside of textual or visual chatbots or CRUD apps, no one wants to "put up or shut" a solid example that the top management of an existing company would sign off on. Only stories about awesome examples they and their friends are working on ... which often turn out to be CRUD apps or textual or visual chatbots. One notable standout is generative image apps can be quite good in certain circumstances.
So, since you seem to have a real interest and actual examples of this, I am curious to see some that real companies would gamble that company on. And I don't mean some quixotic startup, I mean a company making real money now with customers that is confident on that app to the point they are willing to risk big. Because that last part is what companies do with other (non LLM) apps. I also know that people aren't perfect and wouldn't expect an LLM to be, just want to make sure I am not missing something.
Could you elaborate? Is this related to the "teams of specialized LLMs" concept I saw last year when Auto-GPT was getting a lot of hype?
at the end of the day though, it's not exactly reliable or particularly transformative when you get past the party tricks
In education at least, we've actively improved efficiency by ~25% across a large swath of educators (direct time saved) - agentic evaluators, tutors and doubt clarifiers. The wins in this industry are clear. And this is that much more time to spend with students.
I also know from 1-1 conversation with my peers in large-finance world, and there too the efficiency improvements on multiple fronts are similar.