At this point, we've all gotten quite used to the "style" of LLM outputs, and personally I doubt this is the case,
however, it is possible that there is some, shall we say,
corruption of the data here, since it was not possible to measure the ability of LLMs to predict the next word
before there were LLMs.
I propose you do the same things, but only include HN content from before the existence of LLMs. That should ensure there is no bias towards any of the models.