SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.
LLM content should just enhance and cement the status quo word frequencies.
Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.
So it's classic positive feedback. LLM uses delve more, delve appears in training data more, LLM uses delve more...
Who knows what other semantic quirks are being amplified like this. It could be something much more subtle, like cadence or sentence structure. I already notice that GPT has a "tone" and Claude has a "tone" and they're all sort of "GPT-like." I've read comments online that stop and make me question whether they're coming from a bot, just because their word choice and structure echoes GPT. It will sink into human writing too, since everyone is learning in high school and college that the way you write is by asking GPT for a first draft and then tweaking it (or not).
Unfortunately, I think human and machine generated text are entirely miscible. There is no "baseline" outside the machines, other than from pre-2022 text. Like pre-atomic steel.
Some day we may view this as the beginnings of machine culture.
Have you ever seen someone use their smartphone? They're not "here," they are "there." Forming themselves in cyberspace -- or being formed, by the machine.
2. Given how LLMs work, a prompt is a bias — they're one-and-the-same. You can't ask an LLM to write you a mystery novel without it somewhat adopting the writing quirks common to the particular mystery novels it has "read." Even the writing style you use in your prompt influences this bias. (It's common advice among "AI character" chatbot authors, to write the "character card" describing a character, in the style that you want the character speaking in, for exactly this reason.) Whatever prompt the developer uses, is going to bias the bot away from the statistical norm, toward the writing-style elements that exist within whatever hypersphere of association-space contains plausible completions of the prompt.
3. Bot authors do SEO too! They take the tf-idf metrics and keyword stuffing, and turn it into training data to fine-tune models, in effect creating "automated SEO experts" that write in the SEO-compatible style by default. (And in so doing, they introduce unintentional further bias, given that the SEO-optimized training dataset likely is not an otherwise-perfect representative sampling of writing style for the target language.)
But, insofar as your use-case is against the TOUs of the big proprietary inference platforms, this second cost quickly swamps the first cost. They keep banning you, and you keep having to buy new dark-web credentials to come back.
Given this, it’s a lot cheaper and more reliable — you might summarize these as “more predictable costs” — to design a system around a substrate whose “immune system” won’t constantly be trying to kill the system. Which means either your own hardware, or a “being your own model” inference platform like RunPod/Vast/etc.
(Now consider that there are a bunch of fly-by-night BYO-model hosted inference platforms, that are charging unsustainable flat-rate subscription prices for use of their hardware. Why do these exist? Should be obvious now, given the facts already laid out: these are people doing TOU-violating things who decided to build their own cluster for doing them… and then realized that they had spare capacity on that cluster that they could sell.)
TFA mentions this hasn't been the case.
Too deep we delved, and awoke the ancient delves.