https://arxiv.org/abs/2306.03341
> Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
The problem here is that there is currently no reliable way to extract information from this hypothetical world model. Language models do not always say what they "believe", they might instead say what is politically correct, what sounds good etc. Researchers try to optimize (fine-tune) language models to be helpful, honest, and harmless, but honesty ("truthfulness") can't be easily optimized for.
No comments yet.