It just seems to understand. This is useful, and deeply impressive, but it's not the same thing.
ChatGPT clearly needs reinforcement to know its limits the way humans are constantly reinforced from childhood that making confident claims about things we don't actually know well is often going to have negative consequences, but until we've seen the result of doing that, I don't think we have any foundation to say anything about whether ChatGPT's current willingness to make up answers means it's level of understanding is in any way fundamentally different from that of humans or not.
- In my most recent tests, it will tell you when the data you've provided doesn't match the task (instead of inventing an answer out of thin air).
- It will also add (unprompted) comments/notes before or after the result to disclaim a plausible reason why certain choices have been made or the answer isn't complete.
You have to take into account that not everyone wants the model to not hallucinate. There is a lot of competing pressure:
- Some people would like the model to say "As an AI model trained by OpenAI, I am not qualified to provide an answer because this data is not part of my training set" or something similar, because they want it to only talk when it's sure of the data. (I personally think this use case - using LLMs as search engines/databases of truth - is deeply flawed and not what LLMs are for ; but for a large enough GPT-n, it would work perfectly fine. There is a model size where the model would indeed contain the entire uncompressed Internet, after all)
- Some people want the model to never give such a denial, and always provide an answer in the require format, even if it requires the model to "bullshit" or improvise a bit. An example is, if that as a business user I provide the model with a blog article and ask for metadata in a JSON structure, I want the model to NEVER return "As an AI model..." and ALWAYS return valid JSON, even if the metadata is somewhat shaky or faulty. Most apps are more tolerant to BS than they are to empty/invalid responses. That's the whole reason behind all of those "don't reply out of character/say you are an AI" prompts you see floating around (which, in my experience, are completely useless and do not affect the result one bit)
So the reinforcement is constantly going in those opposite directions.
With respect to the competing draws here, I'm not sure they necessarily compete that much. E.g. being able to ask it to speculate but explain justifications or being able to provide a best guess or being able to ask it to just be as creative as it wants, would be sufficient. Alternatively one that knows how to embed indication of what it knows to be true vs. what is speculation or pure fiction. Of course we can "just" also train different models with different levels of / types of reinforced limits. Having both a "anything goes" model and a constrained version that knows what it does know might both be useful for different things.
Those people are reasoning, however rightly or wrongly, based on that wrong information. LLMs are not.
Claiming we can tell that there's a distinction that merits saying people are reasoning and LLMs are not, is "hallucination" to me. It's making a claim there is insufficient evidence to make a reasoned statement about.
EDIT: Ironically, on feeding ChatGPT (w/GPT4) my comment and your reply and asking it to "compose a reply on behalf of 'vidarh'" it produced a reply that was far more willing to accept your claim that there is a fundamental difference (while otherwise giving a reasonable reaffirmation of my argument that reinforcement of the boundaries of its knowledge would reduce its "hallucinations")
there is a qualitative difference: humans may be wrong about facts because they think they are true, ChatGPT is wrong because it does not know what anything means. You cannot fix that, because it's just the way LLMs work.
For example, if asked about a URL for something, a human may remember it wrongly, but will in general say "I don't know, let me check", while ChatGPT will just spew something.
If only that were universally true!
Edit: and I also don't see why it has to be so black-and-white. IMHO there is no problem with saying it understands certain things, and doesn't understand other things. We are talking about general intelligence, not god-like omniscience.
Does stockfish "understand" chess, or does it just "seem to understand" chess ?
For all practical purposes and intent, this doesn't make much difference.
But in AGI, G is the important bit, and neither Stockfish nor ChatGPT have demonstrated general understanding of the world.