https://aiindex.stanford.edu/report/
"As internet pioneer and Google researcher Vint Cerf said Monday, AI is "like a salad shooter," scattering facts all over the kitchen but not truly knowing what it's producing. "We are a long way away from the self-awareness we want," he said in a talk at the TechSurge Summit."
https://www.cnet.com/tech/computing/bing-ai-bungles-search-r...
Since I'm not a student anymore, I can just give ChatGPT a few bullet points and ask it to write a paragraph for me. As an engineer who doesn't like writing "fluff", it's great I can now outsource the BS part of writing.
I'm interested what parts of your job require the fluff? Is it communication with non engineering teams?
It's also great for writing a professional sounding complaint letter to your utility company.
The future is people typing bullet points, expanding into polished prose for transmission, and compressing down to bullet points on the other end.
Today, ChatGPT helped me write a driver.
The driver either compiles, or it doesn't; it compiled. The driver either reads a value from a register, or it doesn't; it read. The driver either causes the chip to physically move electrons in the real world in the way that I want it to, or it doesn't.
The real world does not distinguish between bullshit or not. Things either work or they do not. They either are one way, or they are another way. ChatGPT produces things that work in reality. We humans live in reality. Reality is what matters.
I notice a thread through all of the breathless panicking about LLMs: it does not correspond to REALITY. It's a panic about a fiction. The fiction that the content of text is reality itself. The fiction that the LLM can somehow recursively improve itself. The fiction that the map is the territory.
The one example that still interests me is math problem solving. Can next-token predictors really solve generalized math problems as well as children? https://arxiv.org/abs/2110.14168
It's not only in America, not only in government or large corporate. It's everywhere.
Do you need 7B/13B/33B/77B parameters to do this? That is a question up for debate and something I'm exploring with the concept of micro/nano models (https://neuml.hashnode.dev/train-a-language-model-from-scrat...). There is the sense that today's LLMs could be overkill for a problem such as RAG.
I've been using GPT-4 to write code almost daily for months now, and I'd estimate that it is maybe 80-90% accurate in general, with the caveat that the quality of the prompt can have a major impact on this. If the prompt is vague, you're unlikely to get good results on the first try. If the prompt is very thorough and precise, and relevant context is included, it can often nail even fairly complex tasks in one shot.
Regardless of what the accuracy number is, it strikes me as pretty silly to call them "BS Machines". It's like calling human programmers "bug machines". Yeah, we do produce a lot of bugs, but we somehow seem to get a quite a bit of working software out the door.
GPT-4 isn't perfect and people should certainly be aware that it makes mistakes and makes things up, but it also produces quite a lot of extremely useful output across many domains. I know it's made me more productive. Honestly, I can't think of any programming language, framework, technique, or product that has increased my productivity so quickly or dramatically in the 17 years I've been programming. Nothing else even comes close. Pretty good for a BS machine.
Sure, the first-order output of today's generalist LLMs outputting one token at a time do seem to meet meet diminishing returns on factuality at approximately the level of a college freshman pulling an all-nighter. Not a great standard, that. But if you took an entire class of those tired freshmen, gave their outputs to an independent group of tired freshmen unfamiliar with the material, and told the second group to identify, in a structured manner, commonalities and discrepancies and topics they'd look up in an encyclopedia and things they'd like to escalate to a human expert on, and so on... all of a sudden, you can start to build structured knowledge about the topic, and an understanding of what is and isn't likely to be a hallucination.
One might argue that the right kind of model architecture and RLHF could bake this into the LLM itself - but you don't need to wait for that research to be brought into production to create a self-correcting system-of-systems today.
That appears more or less correct
So to give chatgpt an opportunity to hallucinate similarly to the article, I followed up with, “Did he write for the nyt?” and it replied, “ I do not have any information indicating that Ryan McGreal has written for The New York Times (NYT). His work primarily focuses on urban issues and transportation, as mentioned earlier, and he is associated with Raise the Hammer, a local publication in Hamilton, Ontario, Canada. It’s possible that he may have contributed to other publications, but I do not have specific information regarding his contributions to The New York Times.”
While I have seen ChatGPT make stuff up I do think it’s useful to compare specific results across LLMs before using particular examples to make holistic statements.
Ask in this order:
1) what is NYT (New York times)?
2) Who is Ryan McGreal?
3) Did he wrote for NYT?
This build up more context for hallucinating
So I'm curious why my personal experience doesn't match all the complains about hallucinations.
These ones are premised on regurgitating inputs. That they can imitate more than one observer's interpretation of truth at one time. More the better.
Humans have been incentivized to essentially be BS machines.
From low-quality blog posts to the highest-grossing marketing and everything in between (including many published books and scientific papers): BS makes enough money that it’s low-effort gives a decent ROI.
Of course an AI trained on a large human corpus is going to produce BS. It’s just doing what it learned.
Unless it's mechanical work, it requires some form of BS, and that's why we've traditionally been so much better at this than machines. We've never been able to create "BS machines" before, so this completely shifts the paradigm.