It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.
Blog spam was generally written by humans. While it sucked for other reasons, it seemed fine for measuring basic word frequencies in human-written text. The frequencies are probably biased in some ways, but this is true for most text. A textbook on carburetor maintenance is going to have the word "carburetor" at way above the baseline. As long as you have a healthy mix of varied books, news articles, and blogs, you're fine.
In contrast, LLM content is just a serpent eating its own tail - you're trying to build a statistical model of word distribution off the output of a (more sophisticated) model of word distribution.
SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.
LLM content should just enhance and cement the status quo word frequencies.
Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.
For a while I thought email as a medium was doomed, but spammers mostly lost that arms race. One interesting difference is that with spam, the large tech companies were basically all fighting against it. But here, many of the large tech companies are either providing tools to spammers (LLMs) or actively encouraging spammy behaviors (by integrating LLMs in ways that encourage people to send out text that they didn't write).
At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.
Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?
The top of search results is consistently crowded by pages that obviously game ranking metrics instead of offering any value to humans.
I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.
Based on the process above, naturally, the third iteration then is LLMs writing for corporate bots, neither for humans nor for other LLMs.
How do we know what content LLMs were fed? Isn't that a highly guarded secret?
Won't the quality of the content be paramount to the quality of the generated output or does it not work that way?
2017: Invention of transformer architecture
June 2018: GPT-1
February 2019: GPT-2
June 2020: GPT-3
March 2022: GPT-3.5
November 2022: ChatGPT
You may want to add kiwix archives from before whatever date you choose. You can find them on the Internet Archive, and they're available for Wikipedia, Stack Overflow, Wikisource, Wikibooks, and various other wikis.The new stuff generated does (and this is honestly already captured).
This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.
Their research and projects are great.
Making resources like wordfreq more visible won't exacerbate any of these concerns.
See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.
Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.
Edit: just the first one
The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").
Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.
When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good will prevail in the end.
Every content system seems to get polluted by noise once it hits mainstream usage: IRC, Usenet, reddit, Facebook, geocities, Yahoo, webrings, etc. Once-small curated selections eventually grow big enough to become victims of their own successes and taken over by spam.
It's always an arms race of quality vs quantity, and eventually the curators can't keep up with the sheer volume anymore.
You ask on HN, one of the highest quality sites I've ever visited in any age of the Internet.
IRC is still alive and well among pretty much the same audience as always. I'm not sure it's fair to compare that with the others.
1. build a userbase, free product
2. once userbase get big enough, any new account requires a monthly fee, maybe $1
3. keep raising the fee higher and higher, until you get to the point that the userbase is manageable.
no ads, simple.
The people who stay away from critical thinking were doing that already and will continue to do so, 'AI' content or not.
Even if, this is a dangerous thought that discourages decisive action that is likely to be necessary for this to happen.
Sci-fi author:
I created the Torment Nexus to serve as a cautionary tale...
Tech Company:
Alas, we have created the Torment Nexus from the classic Sci-fi novel "Don't Create the Torment Nexus"
1. https://www.marxists.org/archive/marx/works/1894-c3/ch25.htm
As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.
Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now.
This is going to be the thing that makes me quit Amazon. If I'm missing something and there's still a way to to a direct search, please tell me.
Product page (copy the identifier at the end): https://www.amazon.com/Long-Thanks-Hitchhikers-Guide-Galaxy-...
Review page (paste the identifier at the end): https://www.amazon.com/product-reviews/B001OF5F1E/
This seems to bypass all of the LLM stuff for now.
I used to be able to say search for Trek bike derailleur hanger and the first result would be what I wanted. Now I have to scroll past 5 ads to buy a new bike, one that's a broken link to a third party, and if I'm really lucky, at the bottom of page 1 will be the link to that part's page.
The shitification of the web is real.
LEEZWOO 15.6" Laptop - 16GB RAM 512GB SSD PC Laptop, Quad-Core N95 Processor Up to 3.1GHz, Laptop Computers with Touch ID, WiFi, BT4.2, for Students/Business
Name rolls off the tongue doesn’t it
On the other hand, what you call "The Web" seems to be just what you can get at through search engines. There's still the old web, the thing that's mediated by relationships and reputation rather than aggregation services with billions of users. Like the link I shared above. Or this heroically moderated site we're using right now.
To get to the milk you'll have to walk by 3 rows of chips and soda.
Hey, who cares about making services that work when we can give people a cool chatbot assistant and a 1800 number with no real-person alternative to the decision tree
We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.
Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.
The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.
https://www.nytimes.com/interactive/2024/09/09/technology/ai...
https://www.nytimes.com/interactive/2024/01/19/technology/ar...
These are a little bit unfair, in that we're comparing handpicked examples, but I don't think many experts will pass a test like this. Technology only moves forward (and seemingly, at an accelerating pace).
What's a little shocking to me is the speed of progress. Humanity is almost 3 million years old. Homosapiens are around 300,000 years old. Cities, agriculture, and civilization is around 10,000. Metal is around 4000. Industrial revolution is 500. Democracy? 200. Computation? 50-100.
The revolutions shorten in time, seemingly exponentially.
Comparing the world of today to that of my childhood....
One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.
I'm not quite sure where this is all headed.
It really isn't. Have a look at daily median income statistics for the rest of the planet:
https://ourworldindata.org/grapher/daily-median-income?tab=t...
$2.48 Eastern and Southern Africa (PIP)
$2.78 Sub-Saharan Africa (PIP)
$3.22 Western and Central Africa (PIP)
$3.72 India (rural)
$4.22 South Asia (PIP)
$4.60 India (urban)
$5.40 Indonesia (rural)
$6.54 Indonesia (urban)
$7.50 Middle East and North Africa (PIP)
$8.05 China (rural)
$10.00 East Asia and Pacific (PIP)
$11.60 Latin America and the Caribbean (PIP)
$12.52 China (urban)
And more generally: $7.75 World
I looked around on Ali, and the cheapest charger that doesn't look too dangerous costs around five bucks. So it's roughly equal to one day's income of at least half the population of our planet.Flashlights? Sure, bring on aliexpress. USB cables with pop-off magnetically attached heads, no problem. But power supplies? Welp, to each their own!
Is there a big recent qualitative change here? Or is this a continuation of manufacturing trends (also shocking, not trying to minimize it all, just curious if there’s some new manufacturing tech I wasn’t aware of).
For some reason, your comment got me thinking of a fully automated system, like: you go to a website, pick and choose charger capabilities (ports, does it have a battery, that sort of stuff). Then an automated factor makes you a bespoke device (software picks an appropriate shell, regulators, etc). I bet we’ll see it in our lifetimes at least.
There is the very real possibility that everything just stalls and plateau where we are at. You know, like our population growth, it should have gone exponentially but it did not. Actually, quite the reverse.
Progress isn't inevitable. It's possible for knowledge to be lost and for civilization to regress.
The Technological Singularity - https://en.wikipedia.org/wiki/Technological_singularity
I don't share your confidence in identifying real people anymore.
I often flag as "false-ish" a lot of things from genuinely real people, but who have adopted the behaviors of the TikTok/Insta/YouTube creator. Hell, my beard is grey and even I poked fun at "YouTube Thumbnail Face" back in 2020 in a video talk I gave. AI twigs into these "semi-human" behavioral patterns super fast and super hard.
There is a video floating around with pairs of young ladies with "This is real"/"This is not real" on signs. They could be completely lying about both, and I really can't tell the difference. All of them have behavioral patterns that seems a little "off" but are consistent with the small number of "influencer" videos I have exposure to.
I don't. I mean, I can identify the bad ones, sure, but how do I know I'm not getting fooled by the good ones?
I see a lot of outrage around fake posts already. People want to believe bad things from the other tribes.
And we are going to feed them with it, endlessly.
Even what's free & open source in the special effects community is astonishing lately.
And it already happened, and no one pushed back while it was happening.
It's by Language Jones, a YouTube linguist. Title: "The AI Apocalypse is Here"
Fair and accurate. In the best cases the person running the model didn't write this stuff and word salad doesn't communicate whatever they meant to say. In many cases though, content is simply pumped out for SEO with no intention of being valuable to anyone.
On the one hand, I completely agree with Robyn Speer. The open web is dead, and the web is in a really sad state. The other day I decided to publish my personal blog on gopher. Just cause, there's a lot less crap on gopher (and no, gopher is not the answer).
But...
A couple of weeks ago, I had to send a video file to my wife's grandfather, who is 97, lives in another country, and doesn't use computers or mobile phones. Eventually we determined that he has a DVD player, so I turned to x264 to convert this modern 4K HDR video into a form that can be played by any ancient DVD player, while preserving as much visual fidelity as possible.
The thing about x264 is, it doesn't have any docs. Unlike x265 which had a corporate sponsor who could spend money on writing proper docs, x264 was basically developed through trial and error by members of the doom9 forum. There are hundreds of obscure flags, some of which now operate differently to what they did 20 years ago. I could spend hours going through dozens of 20 year old threads on doom9 to figure out what each flag did, or I could do what I did and ask a LLM (in this case Claude).
Claude wasn't perfect. It mixed up a few ffmpeg flags with x264 ones (easy mistake), but combined with some old fashioned searching and some trial and error, I could get the job done in about half an hour. I was quite happy with the quality of the end product, and the video did play on that very old DVD player.
Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance, which apparently brought a massive smile to his face.
Like everything before them, LLMs are just tools. Neither inherently good nor bad. It's what we do with them and how we use them that matters.
Didn't most DVD burning software include video transcoding as a standard feature? Back in the day, you'd have used Nero Burning ROM, or Handbrake - granted, the quality may not have been optimized to your standards, but the result would have been a watchable video (especially to 97 year-old eyes)
In my opinion the internet can be considered as the equivalent of a natural environment like the earth. it's a space where people share, meet, talk, etc.
I find it astonishing that after polluting our natural environment we know polluted the internet.
If we haven't already, we will be very soon. I'm sure there are people working on this problem, but I think we're starting to hit a very imminent feedback loop moment. Most of human's recorded information is digitized and most of that is generating non-human content at an incredible pace. We've injected a whole lot of noise into our usable data.
I don't know if the answer is more human content (I'm doing my part!) or novel generative content but this interim period is going to cause some medium-term challenges.
I like to think the LLM more-tokens-equals-better era is fading and we're getting into better use of existing data, but there's a very real inflection point we're facing.
Corporations did that, not humans.
"few people recognize that we already share our world with artificial creatures that participate as intelligent agents in our society: corporations" - https://arxiv.org/abs/1204.4116
Nice try
If it’s not clear, I’m joking.
(I initially wanted to say 'paid for by the government' but that'd be socialising losses and we've had quite enough of that in the past.)
Next token-seeking is a solved problem. Novel thinking can be solved by humans and possibly by AI soon, but adding more garbage to the data won't improve things.
>> Generative AI has polluted the data
Just like low-background steel marks the break in history from before and after the nuclear age, these types of data mark the distinction from before and after AI.
Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn. Meanwhile certain low-frequency data will be selected by these models less and less and will become suppressed and possibly eliminated. We know from classic NLP techniques that low frequency words are often among the highest in information content and descriptive power.
Bitrot will continue to act as the agent of Entropy further reducing pre-AI datasets.
These feedback loops will persist, language will be ground down, neologisms will be prevented and...society, no longer with the mental tools to describe changing circumstances; new thoughts unable to be realized, will cease to advance and then regress.
Soon there will be no new low frequency ideas being removed from the data, only old low frequency ideas. Language's descriptive power is further eliminated and only the AIs seem able to produce anything that might represent the shadow of novelty. But it ends when the machines can only produce unintelligible pages of particles and articles, language is lost, civilization is lost when we no longer know what to call its downfall.
The glimmer of hope is that humanity figured out how to rise from the dreamstate of the world of animals once. Future humans will be able to climb from the ashes again. There used to be a word, the name of a bird, that encoded this ability to die and return again, but that name is already lost to the machines that will take our tongues.
That's why on FB I mark my own writing as AI generated, and the AI generated slop as genuine. Because what is disguised as "transparency disclaimer" is just flagging content of what's a potential dataset to train from and what isn't.
Apropos of nothing in particular, see LinkedIn now admitting [1] it is training its AI models on "all users by default"
What would it take for Open AI overlords to inject words they want to force into usage in their models and will new words into use? Few have had the power to do such things. Open AI through its popular GPT platform now has the potential of dictating the evolution of human language.
This is novel and scary.
Or we'll be fine, because inbreeding isn't actually sustainable either economically nor technologically, and to most of the world the Silicon Valley "AI" crowd is more an obnoxious gang of socially stunted and predatory weirdos than some unstoppable omnipotent force.
But we do know that now it's a lot more, with a big LOT.
Many of my searches nowadays include suffixes like "site:reddit.com" (or similar havens of, hopefully, still mostly human-generated content) to produce reasonably useful results. There's so much spam pollution by sites like Medium.com that it's disheartening. It feels as if the Internet humanity is already on the retreat into their last comely homes, which are more closed than open to the outside.
On the positive side:
1. Self-managed blogs (like: not on Substack or Medium) by individuals have become a strong indicator for interesting content. If the blog runs on Hugo, Zola, Astro, you-name-it, there's hope.
2. As a result of (1), I have started to use an RSS reader again. Who would have thought!
I am still torn about what to make of Discord. On the one hand, the closed-by-design nature of the thousands of Discord servers, where content is locked in forever without a chance of being indexed by a search engine, has many downsides in my opinion. On the other hand, the servers I do frequent are populated by humans, not content-generating bots camouflaged as users.
Maybe we actually need to preserve all the old movies / documentaries / books in all languages and mark them as pre-LLM / non-LLM.
But I hazard a guess this wont happen, as its a common good that could only be funded by left-leaning taxation policies - no one can make money doing this, unlike burning carbon chains to power LLMs.
As mentioned, we have heuristics like frequency of the word "delve", and simple techniques such as measuring perplexity. I'd like to see a GAN style approach to this problem. It could potentially help improve the "humanness" of AI-generated content.
It's actually not. It's rather difficult for humans as well. We can see verbose text that is confused and call it AI, but it could just be a human aswell.
To borrow an older model training method, "Generative adversarial network". If we can distinguish AI from humans... We can use it to improve AI and close the gap.
So, it becomes an arms race that constantly evolves.
If we add linguistics to NLP I can see an argument but if we define NLP as the research of enabling a computer process language then it seems to me that LLM’s/ Generative AI is the only research that an NLP practitioner should focus on and everything else is moot. Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?
compare the frequency of words to those used in human natural writings and you spot the computer from the human.
Brains aren't nearly as good at slightly adjusting the statistical properties of a text corpus as computers are.
Hmm I don’t disagree but I think it will be valuable skill going forward to write text that doesn’t read like it was written by an LLM
This is an arms race that I’m not sure we can win though. It’s almost like a GAN.
But that's a losing endeavor: if you can do that, you can immediately ask your LLM to fix its output so that it passes that test (and many others). It can introduce typos, make small errors on purpose, and anything you can think of to make it look human.
I'm sure this has occurred to them already. Apart from the near-impossibility of continuing the task in the same way they've always done it, it seems like the other reason they're not updating wordfreq is to stick a thumb in the eye of OpenAI and Google. While I appreciate the sentiment, I recognize that those corporations' eyes will never be sufficiently thumbed to satisfy anybody, so I would not let that anger change the course of my life's work, personally.
I like this.
Maybe even take it a step further - have a badge on the source that is both human and machine visible to indicate that the content is not AI generated.
Or potentially even more dystopian would be that AI slop would be dictating/driving human communication going forward.
Might even change the tool name.
I guess the same way the scientists had to account for the bomb pulse in order to provide accurate carbon-14 dating, wordfreq would need a magic way to account for non human content.
Saying magic, because unfortunately it was much easier to detect nuclear testing in the atmosphere than to it will be to detect AI-generated content.
In case that doesn't get my comment completely buried, I will go ahead and say honestly that even though "AI slop" and paywalled content is a problem, I don't think that generative AI in itself is a negative at all. And I also think that part of this person's reaction is that LLMs have made previous NLP techniques, such a those based on simple usage counts etc., largely irrelevant.
What was/is wordfreq used for, and can those tasks not actually be done more effectively with a cutting edge language model of some sort these days? Maybe even a really small one for some things.
There is the case of what is "truth". As soon as you start to ensure some quality of truth to what is generated, that is political.
As soon as generative AI has the capability to take someone's job, that is political.
The instant AI can make someone money, it is political.
When AI is trained on something that someone has created, and now they can generate something similar, it is political.
It could also be useful for guessing whether someone might have been trying to do some kind of steganographic or additional encoding in their work, by telling you how abnormal compared to how many people write it is that someone happened to choose a very unusual construction in their work, or whether it's unlikely that two people chose the same unusual construction by coincidence or plagiarism.
You might also find statistical models interesting for things like noticing patterns in people for whom English or others are not their first language, and when they choose different constructions more often than speakers for whom it was their first language.
I'm not saying you can't use an LLM to do some or all of these, but they also have something of a scalar attached to them of how unusual the conclusion is - e.g. "I have never seen this construction of words in 50 million lines of text" versus "Yes, that's natural.", which can be useful for trying to inform how close to the noise floor the answer is, even ignoring the prospect of hallucinations.
I certainly have not encountered enough straight drivel where I would think it would have a significant effect on overall word statistics.
I suspect there may be some over-identification of AI content happening, a sort of Baader–Meinhof effect cognitive bias. People have their eye out for it and suddenly everything that reads a little weird logically "must be AI generated" and isn't just a bad human writer.
Maybe I am biased, about a decade ago I worked for an SEO company with a team of copywriters who pumped out mountains the most inane keyword packed text designed for literally no one but Google to read. It would rot your brain if you tried, and it was written by hand by a team of humans beings. This existed WELL before generative AI.
How confident are you in this assessment?
> straight drivel
We're past the point where what AI generates is "straight drivel"; every minute, it's harder to distinguish AI output from actual output unless you're approaching expertise in the subject being written about.
> a team of copywriters who pumped out mountains the most inane keyword packed text designed for literally no one but Google to read.
And now a machine can generate the same amount of output in 30 seconds. Scale matters.
There are charts / graphs in the link, both since 2021, and since earlier.
The final graph suggests the phenomenon started earlier, possibly correlated in some way to Malaysian / Indian usages of English.
It does seem OpenAI's family of GPTs as implemented in ChatGPT unspool concepts in a blend of India-based-consultancy English with American freshmen essay structure, frosted with superficially approachable or upbeat blogger prose ingratiatingly selling you something.
Anthropic has clearly made efforts to steer this differently, Mistral and Meta as well but to a lesser degree.
I've wondered if this reflects training material (the SEO is ruining the Internet theory), or is more simply explained by selection of pools of Hs hired for RLHF.
> As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.
This will surely affect how we speak. It's possible that human language evolution could come to a halt, stuck in time as AI datasets stop being updated.
In the worst case, we will see a global "model collapse" with human languages devolving along with AI's, if future AIs are trained on their own outputs...
https://news.ycombinator.com/item?id=34966335
We will all get used to it.
It seems however it started increasing most in usage just these last few months, maybe people are talking more about "delve" specifically because of the increase in usage? A usage recursion of some sorts.
But also, words and phrases do become popular among humans, right? It would be a shame if AI caused the language to get more stagnant, as keeping up with which phrases are popular get you labeled as an AI.
The funny fact: It doesn't result in the increase for search results for "delve".
Someone should start scanning all those microfiche archives in local libraries and sell the data.
This is a self-inflicted problem, IMO.
Do you just have shitty friends that share all that crap? Or are you following shitty pages?
I use Facebook a decent amount, and I don't suffer from what you're complaining about. Your feed is made of what you make it. Unfollow the pages that make that crap. If you have friends that share it, consider unfriending or at the very least, unfollowing. Or just block the specific pages they're sharing posts from.
'OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.'
really does betray some real naivete. OpenAI and Google could literally burn $10million dollars per day (okay, maybe not OpenAI - but Google surely could) and reasonably fail to notice. Whatever costs those companies have to pay to collect training data will be well worth it to them. Any messes made in the course of obtaining that data will be dealt with by an army of employees either manually cleaning up the data, or by algorithms Google has its own LLM write for itself.
I do find the general sense of impending dystopian inhumanity arising out of the explosion of LLMs to be super fascinating (and completely understandable).
Maybe this is because I’m European, but what is partisan about calling X invariably worthless drivel? Seems a lot like facts to me considering what has been going on with the platform moderation since Elon Musk bought it. It’s so bad that the EU consider it a platform for misinformation these days.
I then changed course. Why? I had read increasing reports of human e-book pirates (copying your book's content then repackaging it for sale under a diff title, byline, cover, and possibly at a much lower or even much higher price.)
And then the rise of LLMs and their ravenous training ingest bots -- plagiarism at scale and potentially even easier to disguise.
"Not gonna happen." - Bush Sr., via Dana Carvey
Now I keep the bulk of my book material non-public during dev. I'm sure I'll share a chapter candidate or so at some point before final release, for feedback and publicity. But the bulk will debut all together at once, and only once polished and behind a paywall
Two of the languages we support, Serbian and Chinese, are written in multiple scripts. To avoid spurious differences in word frequencies, we automatically transliterate the characters in these languages when looking up their words.
Serbian text written in Cyrillic letters is automatically converted to Latin letters, using standard Serbian transliteration, when the requested language is sr or sh."
I'd support keeping both scripts (српска ћирилица and latin script) , similarly to hiragana (ひらがな) and katakana (カタカナ) in Japanese.
> The field I know as "natural language processing" is hard to find these days. It's all being devoured by generative AI. Other techniques still exist but generative AI sucks up all the air in the room and gets all the money.
Traditional NLP has been surpassed by transformers, making this project obsolete. The rest of the post reads like rationalization and sour grapes.
> I don't think anyone has reliable information about post-2021 language usage by humans.
It's information about language usage by humans. We know the rate at which generated text has increased after 2021. How do we filter this to only have data from humans?
The bottom is just lamenting what's happening in the field (which is pretty much what everyone that's been doing anything with NLP research is also complaining about behind closed doors).
How sure can we be about that?
The conclusion I arrived at was that making my own crawler actually is feasible (and given my goals, necessary!) because I'm only interested in a very, very small fraction of what's out there.
For some of us, it was 1994, the eternal September.
For some of us, it was when Aaron Swartz left us.
For some of us, it was when Google killed Google Reader (in hindsight, the turning point of Google becoming evil).
For some others, like the author of this post, it's when twitter and reddit closed their previously open APIs.
I think a decade or two ago, when most of the new tech being introduced (at least by our industry) started being unmistakably abusive and dehumanizing. When the recent past shows a strong trend, it's not unreasonable to expect the the near future will continue that trend. Particularly when it makes companies money.
That’s neither fair nor accurate. That slop is ultimately generated by the humans who run those models; they are attempting (perhaps poorly) to communicate something.
> two companies that I already despise
Life’s too short to go through it hating others.
> it's very likely because they are creating a plagiarism machine that will claim your words as its own
That begs the question. Plagiarism has a particular definition. It is not at all clear that a machine learning from text should be treated any differently from a human being learning from text: i.e., duplicating exact phrases or failing to credit ideas may in some circumstances be plagiarism, but no-one is required to append a statement crediting every text he has ever read to every document he ever writes.
Credits: every document I have ever read grin
This kind of AI slop is quite literally written by no one (an algorithm pushed it out), and it doesn't communicate anything since communication first requires some level of understanding of the source material - and LLM's are just predicting the likely next token without understanding. I would also extend this to AI slop written by someone with a limited domain understanding, they themselves have nothing new to offer, nor the expertise or experience to ensure the AI is producing valuable content.
I would go even further and say it's "read by no one" - people are sick and tired of reading the next AI slop article on google and add stuff like "reddit" to the end of their queries to limit the amount of garbage they get.
Sure there are people using LLMs to enhance their research, but a vast, vast majority are using it to create slop that hits a word limit.
Given that LLMs and human creativity work on fundamentally different principles, there is every reason to believe there is a difference.
The issue with generative 'AI' isn't that they generate text, it's that they can (and are) used to generate high-volume low-cost nonsense at a scale no human could ever achieve without them.
> Life’s too short to go through it hating others
Only when they don't deserve it. I have my doubts about Google, but I've no love for OpenAI.
> Plagiarism has a particular definition ... no-one is required to append a statement crediting every text he has ever read
Of course they aren't, because we rightly treat humans learning to communicate differently from training computer code to predict words in a sentence and pass it off as natural language with intent behind it. Musicians usually pay royalties to those whose songs they sample, but authors don't pay royalties to other authors whose work inspired them to construct their own stories maybe using similar concepts. There's a line there somewhere; falsely equating plagiarism and inspiration (or natural language learning in humans) misses the point.
And yet we're filled to the gills with Luddite sentiments and AI content fearmongering.
Imagine the hysteria and the skull-vibrating noise of the non-HN rabble when they come to understand where all of this is going. They're going to do their darndest to stop us from achieving post-economy.
The dependency on closed data combined with the cost of compute to do anything interesting with LLMs has made individual contributions to NLP research extremely difficult if one is not associated with a very large tech company. It's super unfortunate, makes the subject area much less approachable, and makes the people doing research in the subject area much more homogeneous.
It's already happening. There is a growing number of groups that are forming their own "private internets" that is separated from the internet-at-large, precisely because the internet at large is becoming increasingly useless for a whole lot of valuable things.
>Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay.
>And given what's happening to the field, I don't blame them."
What beautiful doublethink.
Given just how many AI bots scrape up everything they can, oftentimes ignoring robots.txt or any rate limits (there have been a few complaint threads on HN about that), I can hardly blame the operators of large online services just cutting off data feeds.
Twitter however didn't stop their data feeds due to AI or because they wanted money, they stopped providing them because its new owner does everything he can to hinder researchers specializing in propaganda campaigns or public scrutiny.
God I hate this dystopic timeline we live in.
And now a hopefully new comment: having a word frequency measure of the internet as we're going into AI being more used would be IMMENSELY useful specifically _because_ more of the internet is being AI generated! I could see such a dataset being immensely useful to researchers who are looking for the impacts of AI on language, and to test empirically a lot of claims the author has made in this very post! What a shame that they stopped measuring.
Also: as to the claims that AI will cause stagnation and a reduction of the variance of English vocabulary used, this is a trend in English that's been happening for over 100 years ( https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s... ). I believe the opposite will happen, AI will increase the average persons vocabulary, since chat AIs tends to be more professionally written than a lot of the internet. It's like being able to chat with someone that has an infinite vocabulary. It also makes it possible for people to read complicated documents well out of their domain, since they can ask not just for definitions but more in depth explanations of what words/sections mean.
Here's to a comment that will never be read because of all the noise in this thread :/
Sound familiar to anyone?
On the level of meta-discourse you seem to want to also speak to: Dang even when people have the Official Corporate Approved Perspective (in particular, the claim that it's "like being able to chat with someone that has an infinite vocabulary" is probably the silliest delusional AI hype I've heard all week) and the most upvotes in the thread they still think they're an embattled ideological minority. Starting to think that literally zero people in the modern world don't have or affect a victim complex of some kind
The web isn't dead, (Gen)AI, SEO, spam and pollution didn't kill anything.
The world is chaotic and net entropy (degree of disorder) of any isolated or closed system will always increase. Same goes for the web. We just have to embrace it and overcome the challenges that come with it.
1. Prove the human-ness of an author... 2. ...without grossly encroaching on their privacy. 3. Ensure that the author isn't passing off AI-generated material as their own.
We'll leave out the "don't let AI models train on my data" part for now.
Whatever solution we come up with, if any, will necessarily be mired in the politics of privacy, anonymity, and/or DRM. In any case, it's hard to conceive of a world where the human web returns as we once knew it.