The training process literally ingests the majority of text on the internet, including a huge volume of SEO garbage, and seeks to create a self-consistent compressed model of that. This is totally imperfect of course but is also likely more truthful than the median Google result, because of the incentive for self-consistency and coherence that is created by the reward function as well as during RL.
Imagine that you had 1,000 years to read every Google result on a particular topic, and literally infinite patience. You would read a lot of rubbish but ultimately you are a smart person, you would figure out the underlying truth and likely produce something that is more valuable than the average or even the sum of the parts.