They say "LLMs are trained on the web", are the web pages converted from HTML into markdown before being fed into training?