If I read through the reports and summaries it generates, it seems at first glance correct - the jargon is used correctly, and physical phenomena referred to mostly accurately. But very quickly I realize that, even with the deep research features and citations, it's making a bunch of incorrect inferences that likely arise from certain concepts (words, really) co-occurring in documents but are actually physically not causally linked or otherwise fundamentally connected. In addition to some strange leading sentences and arguments made, this often ends up creating entirely inappropriate topic headings/ sections connecting things that really shouldn't be together.
One small example of course, but this type of error (usually multiple errors) shows up in both Gemini and OpenAI models, and even with some very specific prompts and multiple turns. And keeps happening for topics in the fields I work in in the physical sciences and engineering. I'm not sure one could RL hard enough to correct this sort of thing (and it is not likely worth the time and money), but perhaps my imagination is limited.
They fail to understand other engineering fields documentation and process are awful. Not that computer science is good because they are even less rigorous.
The difference is other fields don’t log every single change they make into source control and have millions of open source projects to pull from. There aren’t billions of books on engineering to pull from like with language. The information is siloed and those with the keys now know what it’s worth.
How do you find they compare?
- mixes up pronouns (who is "you" or "he")
- cannot keep track of what is where.
- continuously plugs it's guidance slant ("lets cook dinner, Bob! It is paramount to strive for safety and cooperation while doing it!")
— language style is all over the place, comically so.
— when asked about the text it just generated, is able to give valid critique to itself (i.e. having that "insight" does not help the generation)
Journalists may have shallow understanding of topic, but they do not start referring to a person they write about as "me" halfway through.
LLM is uniformly dumb
If a calculator works great 99% of the time you could not use that calculator to build a bridge.
Using AI for more than code generation is still very difficult and requires a human in the loop to verify the results. Sometimes using AI ends up being less productive because you're spending all your time debugging it's outputs. It's great but also there are a lot of questions on if this technology will ultimately lead to the productivity gains that many think are guaranteed in next few years. There is a non zero chance it ends up actually hurting productivity because of all the time wasted trying to get it to produce magic results.
We know for certain that certified lawyers have committed malpractice by using ChatGPT, in part because the made-up citations are relatively easy to spot. Malpractice by engineers might take a little more time to discover.
If the calculator has a little gremlin in it that rolls a random 100-sided die, and gives you the wrong answer every time it rolls a 1, then you certainly can use it to build a bridge. You just need to do each calculation say 10 or 20 times and take the majority answer :)
If the gremlin is clever, it might remember the wrong answers it gave you, and then it might give them to you again if you ask about the same numbers. In that case you might need to buy 10 or 20 calculators that all have different gremlins in them, but otherwise the process is the same.
Of course if all your gremlins consistently lie for certain inputs, you might need to do a lot of work to sample all over your input space and see exactly what sorts of numbers they don't like. Then you can breed a new generation of gremlins that...
My boss built an AI workflow that cost over $600 that does the same thing I already gave him that cost less than $30. He just wanted to use tools he found and did it his way. Now, this had some value, it got more people in the company exposed to AI and he learned from the experience. It's his prerogative as he's the owner of the company. Though he also isn't concerned about the cost and will continue to pay much more. For now. I think as time goes on this will be more scrutinized.
The solution is to play at its strengths and reinforce it with other mediums. You don't build structures with pure concrete. You add rebar. You don't build ships out of only sail and you don't build rail with just iron. You compose materials in a way that makes sense.
LLMs are most useful when the output is immediately verifiable. So let's build frameworks that take that to core. Build everything around verification. And use LLMs for its strengths.
That's happened before with far higher correctness rate than 99%, and it cost Intel $500M. Reliability and accuracy matter. https://en.wikipedia.org/wiki/Pentium_FDIV_bug
You just need to build your products in a manner where the user has the ability to easily double check the results whenever they like. Then they can audit as they see fit, in order to get used to the accuracy level and to apply additional scrutiny to cases that are very important to their business.
if the user is able to so easily verify that the results are accurate, that means that they are able to generate accurate results through other means, which means they don't need the LLM in the first place
But if the alternative is doing calculations by hand (writing code manually) there is a higher chance of making mistakes.
Just like calculations are double checked while building bridges unit tests and code reviews should catch bugs introduced by LLM written code.
A classic example is the Travel Agent. This was already a job driven to near-extinction just by Google, but LLMs are a nail in the travel agent coffin.
The job was always fuzzy. It was always unreliable. A travel agent recommendation was never a stamp of quality or guarentee of satisfaction.
But now, I can ask an LLM to compare and contrast two weeks in the Seychelles with two weeks in the Caribbean, have it then come up with sample itineraries and sample budgets.
Is it going to be accurate? No, it'll be messy and inaccurate, but sometimes a vibe check is all you ever wanted to confirm that yeah, you should blow your money on the Seychelles, or to confirm that actually, you were right to pick the Caribbean.
Or that actually, both are twice the amount you'd prefer to spend, where dear ChatGPT would be more suitable?
etc.
When it comes down to the nitty-gritty, does it start hallucinating hotels and prices? Sure, at that point you break out trip-advisor, etc.
But as a basic "I don't even know where I want to go on holiday ( vacation ), please help?" it's fantastic.
In the 80's and 90's, this is how most people booked their holidays. It was labour intensive, people would spend some time talking with a travel agent in a store, who would have a good idea of the packages available, and be able to make recommendations and match people with holidays.
The remnants of agencies still provide the same services, but (for the most of us) it's all online, it's all tick-box based, and much of the protection is via ATOL/ABTA.
These services still exist, but they're no longer all over the high-street. Names like Thomas Cook, Lunn Poly, have either been absorbed (mostly by TUI), or collapsed, and largely disappeared from the high-street with just a few left. (Mostly Tui).
And those that are left, have been reduced, much like retail banking, to entering your details into the same websites and services available to anyone, and talking you through the results that the computer spits out, that you could have browsed yourself at home. The underpaid travel agent in the store isn't any better connected than you are. In fact, they're possibly even more pushy about pushing you toward the hotels with the best commission than the website is.
And anyway, there is no need to have two networks to iteratively refine output: one suffices (like we naturally are meant to do).
I can't fathom a future where OpenAI for sure doesn't eat dirt, with Anthropic likely not far behind it. nVidia will likely come out fine, since it still has gamers to disappoint, and the infrastructure build out that did occur will crater the cost of GPUs at scale for smaller, smarter companies to take advantage of. So it will likely still kick around, but as another technology, not the second coming of Cyber Christ as it's been hyped to be.
Or being able to explain the static physical forces in a picture that are keeping a structure from collapsing.
Or recommend me a python library which does X, Y and Z with constraints A, B and C.
But I guess you can file all the above under "data analysis".
https://www.plough.com/en/topics/life/technology/computers-c...
/s?
This isn't even an indictment, not really. I'm just reading between the lines here regarding when/how it's used. Nobody with intentionality uses these things. Nobody who CARES what they're making uses these things. And again, I want to emphasize, this is not an attack. There are tons of things I do in my work life that I utterly do not give a shit about, and LLMs have been a blessing for it. Not my code, fuck no. But all the ancillary crap, absolutely.
Not very hard to understand, except it seems to be
I think and say this all the time. But people keep continue to say that AI will take all our jobs and I’m so utterly confused by this.
Sometimes I wonder if I have gone mad or everyone else.
Every type of automation ever invented has led to massive job cuts and yes, some sectors actually did not ever recover
But I never see them actually used this way. At the big institution end, companies and universities will continue to force AI tools on their employees in heavy handed and poorly thought out ways, and use it as an excuse to fire people whenever budgets get tight (or investors demand higher profits). At the opposite scale, with individual users, it’s really alarming how rapidly people seem to stop thinking with their own brain and offload all critical thinking to an LLM. That’s not “extending your capabilities,” that’s letting all your skills atrophy while you train a machine to be your shitty replacement.
Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry.
For the love of God. It's not actual intelligence. This isn't hard. It just randomly spits out text. Use it for what it's good at instead. Text.
Instead of hunting for how to do things in programming using an increasingly terrible search engine, I just ask ChatGPT. For example, this is something I've asked ChatGPT in the past:
in typescript, I have a type called IProperty<T>, how do I create a function argument that receives a tuple of IProperty<T> of various T types and returns a tuple of the T types of the IProperty in order received?
This question that's such an edge case that I wasn't even sure how to word properly actually yielded the answer I was looking for. function extractValues<T extends readonly IProperty<any>[]>(
props: [...T]
): { [K in keyof T]: T[K] extends IProperty<infer U> ? U : never } {
return props.map(p => p.get()) as any;
}
This doesn't look unrealiable to me. It actually feels pretty useful. I just need [...T] there and infer there.Articles like this are still very much needed, to push back against that narrative, regularly, until it DOES become as obvious to everyone as it is to you.
Even this isn't new. A few years ago we had people who sold knives telling everybody you could use knives to drink soup. And in some cases they weren't even kitchen knives, they were switchblades.
But use them to do more important things that require more precision and accuracy?
No thanks
Same thing.
I'm by no means saying that LLMs aren't useful. They're just not reliably useful.
There a socratically-minded people who are more addicted to that moment of belief change, and hence overall vastly more sceptical -- but I think this attitude is extremely marginal. And probably requires a lot of self-training to be properly inculcated into it.
In any case, with LLMs, people really seem to hate the idea that their beliefs about AI and their reliance of LLM output could be systematically mistaken. All the while, when shown output in an area of their expertise, realising immediately that its full of mistakes.
This, of course, makes LLMs a uniquely dangerous force in the health of our social knowledge-conductive processes.
It's basically like a funnel, which can also be used the other way around if the user is okay with quirky side effects. It feels like a lot of people are using the funnel the wrong way around and complaining that it's not working.
The issue is that the vast majority of user-facing LLM use cases are where people don't have these high-quality starting points. They don't have 40k tokens to make 400.
This is the problem. The problem is how bullshit conscripts its dupes into this self-degradation and bad faith dialogue with others.
And of course, how there are mechanisms in society (LLMs now one of them) which correlate this self-degrading shallowness of reasoning -- so that all at once an expert is faced with millions of people with half-baked notions and a great desire to preserve them.
So, the LLM isn't just wrong, it also lies...
It is the person who reads this text as-if written by a person who imparts these capacities to the machine, who treats the text as meaningful. But almost no text the LLM generates could be said to be meaningful, if any.
In the sense that if a two year old were taught to say, "the magnitude of the charge on the electron is the same as the charge on the proton", one would not suppose the two year old meant what was said.
Since the LLM has no interior representational model of the world, only a surface of text tokens laid out as-if it did, its generation of text never comes into direct contact with a system of understanding that text. Therefore the LLM has no capacities ever implied by its use of language, it only appears to.
This appearance may be good enough for some use cases, but as an appearance, it's highly fragile.
I would argue, that if the output of the LLM is to be interpreted as natural speech, and the output makes an authoritative statement, which is factually incorrect, but stated as if it were true, this is a lie.
The problem is that the tech is presented as if it did have the internal state, that you accurately describe it not having.
The lie in this example, is when it is prompted to describe the process by which it reached a result, and that description has no resemblance to the actual process by which it reached the result.
This isn't a misrepresentation of some external facts, but a complete fabrication, that does not represent how it reached that result, at all.
However many users will accept this information, since it only involves internal aspects of the tool itself.
The fact that the LLM doesn't have this introspective information, is part of exactly why LLMs are NOT intelligence, artificial or otherwise.
And yet they are being presented as such, also, a lie...
Since the LLM has no knowledge on how LLMs do addition, it will pick something that seems to makes sense, and it picked the "carry the one" algorithm. New generations of LLMs will probably do better now that they have access to a better answer for that specific question, but it doesn't mean that they have become more insightful.
Those who lie (possibly even to themselves) are those who pretend that mimicry if stretched enough will surpass the actual thing, and foster the deceptive psychological analogies like "hallucinate".
It's just wrong, and then gives misleading explanations of how it got the wrong answer, following the same process that led to the wrong answer in the first place. Lying is a subset of being wrong.
The tech has great applications, why hype the stuff it doesn't do well? Or apply terms that misrepresent the process the s/w uses?
One might say the use of the word "hallucinate" is an analogy, but it's a poor analogy, which further misleads the lay public in what is actually happening inside the LLM, and how it's results are generated.
If you want to assert that "hallucinate" is an analogy, then "lying" is also an analogy.
If every prompt that ever went into an LLM was prefixed with: "Tell me a made up story about: ...", then the user expectation would be more in line with what the output represents.
I'm not averse to the tech in general, but I am against the rampant misrepresentation that's going on...
Although "isn't helpful" is rather dodgy wording. "Helpful" for who? "Helpful" in what way?
I think most users would find it helpful if the output was not presented as correct, when it's incorrect.
If every prompt that ever went into an LLM was prefixed with: "tell me a made up story about:", then the user expectation would be more in line with what the output represents.
But, that's not the way the corps are describing it, is it?
What really concerns me is that the big companies on whose tools we all rely are starting to push a lot of LLM generated code without having increased their QA.
I mean, everybody cut QA teams in recent years. Are they about to make a comeback once big orgs realize that they are pushing out way more bugs?
Am I way off base here?
I believe AI/ML will eventually get there but definitely not with LLMs or hoarding the whole internet. Most of the human know-how isn't on internet!
Oh, I guess I'm a fool.
Problem 1: Training
Using any method like RLHF, DPO, or such guarantees that we train our models to be deceptive.
This is because our metric is the Justice Potter metric: I know it when I see it. Well, you're assuming that this accurate. The original case was about defining porn and well... I don't think it is hard to see how people even disagree on this. Go on Reddit and ask if girls in bikinis are safe for work or not. But it gets worse. At times you'll be presented with the choice between two lies. One lie you know is a lie and the other lie you don't know it is. So which do you choose? Obviously the latter! This means we optimize our models to deceive us. This is true too when we come to the choice between truth and a lie we do not know is a lie. They both look like truths.
This will be true even in completely verifiable domains. The problem comes down to truth not having infinite precision. A lot of truth is contextually dependent. Things often have incredible depth, which is why we have experts. As you get more advanced those nuances matter more and more.
Problem 2: Metrics and Alignment
All metrics are proxies. No ifs, ands, or buts. Every single one. You cannot obtain direct measurements which are perfectly aligned with what you intend to measure.
This can be easily observed with even simple forms of measurements like measuring distance. I studied physics and worked as an (aerospace) engineer prior to coming to computing. I did experimental physics, and boy, is there a fuck ton more complexity to measuring things than you'd guess. I have a lot of rules, calipers, micrometers and other stuff at my house. Guess what, none of them actually agree on measurements. They all are pretty close, but they do differ within their marked precision levels. I'm not talking about my ruler with mm hatch marks being off by <1mm, but rather >1mm. RobertElderSoftware illustrates some of this in this fun video[0]. In engineering, if you send a drawing to a machinist and it doesn't have tolerances, you have actually not provided them measurements.
In physics, you often need to get a hell of a lot more nuanced. If you want to get into that, go find someone that works in an optics lab. Boy does a lot of stuff come up that throws off your measurements. It seems straight forward, you're measuring distances.
This gets less straightforward once we talk about measuring things that aren't concrete. What's a high fidelity image? What is a well written sentence? What is artistic? What is a good science theory? None of these even have answers and are highly subjective. The result of that is your precision is incredibly low. In other words, you have no idea how you align things. It is fucking hard in well defined practical areas, but the stuff we're talking about isn't even close to well defined. I'm sorry, we need more theory. And we need it fast. Ad hoc methods will get you pretty far, but you'll quickly hit a wall if you aren't pushing the theory alongside it. The theory sits invisible in the background, but it is critical to advancements.
We're not even close to figuring this shit out... We don't even know if it is possible! But we should figure out how to put bounds, because even bounding the measurements to certain levels of error provides huge value. These are certainly possible things to accomplish, but we aren't devoting enough time to them. Frankly, it seems many are dismissive. But you can't discuss alignment without understanding these basic things. It only gets more complicated, and very fast.
I use LLM chat for a wide range of tasks including coding, writing, brainstorming, learning, etc.
It’s mostly right enough. And so my usage of it has only increased and expanded. I don’t know how less right it needs to be or how often to reduce my usage.
Honestly, I think it’s hard to change habits and LLM chat, at its most useful, is attempting to replace decades long habits.
Doesn’t mean quality evaluation is bad. It’s what got us where we are today and what will help us get further.
My experience is anecdotal. But I see this divide in nearly all discussions about LLM usage and adoption.
Honestly this is why your experience is different: your expectations are different (and likely lower). I never find they are "mostly right enough", I find they are "mostly wrong in ways that range from subtle mistakes to extremely incorrect". The more subtly they are wrong, the worse I rate their output actually, because that is what costs me more time when I try to use them
I want tools that save me time. When I use LLMs I have to carefully write the prompts, read and understand, evaluate, and iterate on the output to get "close enough" then fix it up to be actually correct.
By the time I've done all of that, I probably could have just written it from scratch.
The fact is that typing speed has basically never been the bottleneck for developer productivity, and LLMs basically don't offer much except "generate the lines of code more quickly" imo
To be clear this isn't a knock on anyone's work, but it does seem to be a source of why "pro-LLM" and "anti-LLM" groups tend to talk past each other.
Just as an example from today, i had a huge pile of yaml documents that needed to have some transformations done to them -- they were pretty simple and obvious, but I just went into cursor, give it a before and after and a few notes, and it wrote a python script in less than 10 seconds that converted everything exactly the way I needed. Did it save me a day of work? Probably not, but probably an hour or so of looking up python docs and iterating until i worked out all the syntax errors myself? An hour here and an hour there adds up to a _lot_ of saved time.
I spent more time just writing this comment then I did asking cursor to write and run that script for me.
Other things I had an LLM do for me just _today_ is fix a github action that was failing, and knock out a developer readme for a helm chart documenting what all the values do -- that's one of the kinds of things where it gets a lot of stuff wrong, but typing speed _is_ the bottleneck. It took me a minute or so to fix the stuff it misunderstood, but the formatting and the bulk of it was fine.
You're comfortable with the uncertainty, and accommodate it in your use and expectations. You're left feeling good about the experience, within that uncertainty. Others are repelled by uncertainty, so will have a negative experience, regardless of how well it may work for a subset of tasks they try, because that repulsive uncertainty is always present.
I think it would be interesting (and possibly very useful/profitable for the marketing/UI departments of companies that use AI) to find the relation between perceived AI usefulness and the results of some of the "standard" personality tests.
I don't want to have to waste time tidying up after an unreliable software tool which is being sold as saving me time. I don't want to be misled by hallucinated fantasies that have no relationship to reality. (See also - lawyers getting laughed out of courtrooms because of this.)
I don't want to have to cancel a travel booking because an AI agent booked me a holiday in Angkor Wat when I wanted a train ticket to Crystal Palace in South London.
Hypotheticals? Not even slightly. Ask anyone who's lost their KDP author account on Amazon or been locked out of Meta because of AI moderation errors.
This is common sense, not some kind of personality flaw.
I'm happy using LLMs for coding and research, but it's also clear the technology is in perpetual beta - at best - and is being wildly oversold.
Normal software operating with this level of reliability would be called "very buggy."
But apparently LLMs get a pass because one day they might not be as buggy as they are today.
Which - if you think about it - is ridiculous, even by the usual standards of the software industry.
As a grown up now I use a dishwasher for everything that is permitted to go in it. I still have to rinse off plates first, and occasionally I do see rice between a fork that I have to then clean manually. But I'm not comfortable knowing that it won't clean as well as I could by hand, but it does a good enough job -- and in some ways a much better job (it uses much hotter water than I do by hand). I don't know if my mom could ever really be comfortable with it though.
It's great for reviews where any given reviewer could be expected to have a misunderstanding of certain details or skip a section (RAG somewhat helps this) - but it's frustrating for artifact generation where missing details cascade through the project.
As great as the technology (right now) it seems so far from reliable business process automation.
It’s also possible - and you should not take this as an insult, it’s just the way it is - you may not know enough about the subjects of your interactions to really spot how wrong they are.
However the cases you list - brainstorming - don’t really care about wrong answers.
Coding is in the eye of the beholder, but for anything that isn’t junk glue code, scripts or low-complexity web stuff, I find the output of LLMs just short of horrendous.
In terms of code output. I have gone from the productivity of being a Sr. Engineer to a team with .8 of a Sr. Engineer, 5 Jr. Engineers and One dude solely dedicated to reading/creating documentation.
Unlike a lot of my fellow engineers who are also from traditional CS backgrounds and haven't worked in revenue restricted startup environments, I also have been VERY into interpreted languages like ruby in the past.
Now compiled languages are even better, I think from a velocity perspective compiled languages are now incredibly on par for prototyping velocity and have had their last weakness removed.
It's both exciting and scary, I can't believe how people are still sleep walking in this environment and don't realize we are in a different world. Once again the human inability to "gut reason" about exponentials is going to screw us all over.
One terribly overlooked thing I've noticed that I think explains the differing takes. Foundation of my position here: https://www.nature.com/articles/s41598-020-60661-8
Within the population that writes code there are a small number of successful people who approach the topic in a ~purely mathematical approach, and a small number of successful people that approach writing code in a ~purely linguistic approach. Most people fall somewhere in the middle.
Those who are on the MOST extreme end of the mathematic side and are linguistically bereft HATE LLM's and effectively cannot use them.
My guess is that HN population will tend to show stronger reactions against LLM's because it was heavily seeded with functional programmers which I think has a concentration of the successful extremely math focused. I worked for several years in a purely functional shop and that was my observation: Elixir, Haskell, Ramda.
Just my speculation.
Also, congratulations on becoming a team. I sure hope you have the mental bandwidth to check all that output carefully. If so, doubly congrats, because you might be the smartest human that ever lived.
This is an interesting observation. It at least aligns with my experience. I wouldn't say I'm "linguistically bereft" lol, but I do lean more toward the "functional programming is beautiful" side. I even have a degree in math. I'm not totally down on LLM coding, but I do fall more on unfavorable feelings side. I mostly just hate the idea of having a bunch of code I don't fully understand, but also am responsible for.
I do use them, and find them helpful. But the idea of fully giving control of my codebase to LLM agents, like some people are suggesting, repels me.
What do you use it for?
In my space, "mostly right enough" isn't useful. Particularly when that means that the errors are subtle and I might miss them. I can't write whitepapers that tell people to do things that would result in major losses.