Why would you crawl the web interface when the data is so readily available in a even better format?
If you don't live and breath Wikipedia it is going to soak up a lot of time figuring out Wikipedia's XML format and markup language, not to mention re-learning how to parse XML. HTTP requests and bashing through the HTML is all everyday web skills and familiar scripting that is more reflexive and well understood. The right way would probably be much easier but figuring it out will take too long.
Although that is all pre-ChatGPT logic. Now I'd start by asking it to solve my problem.
Being a truly good web crawler takes a lot of work, and being a polite web crawler takes yet more different work.
And then, of course, you add the bad coding practices on top of it, ignoring robots.txt or using robots.txt as a list of URLs to scrape (which can be either deliberate or accidental), hammering the same pages over and over, preferentially "retrying" the very pages that are timing out because you found the page that locks the DB for 30 seconds in a hard query that even the website owners themselves didn't know was possible until you showed them by taking down the rest of their site in the process... it just goes downhill from there. Being "not bad" is already not good enough and there's plenty of "bad" out there.
Have you seen the lack of experience that is getting through the hiring process lately? It feels like 80% of the people onboarding are only able to code to pre-existing patterns without an ability to think outside the box.
I'm just bitter because I have 25 years of experience and can't even get a damn interview no matter how low I go on salary expectations. I obviously have difficulty in the soft skills department, but companies who need real work to get done reliably used to value technical skills over social skills.
The generic web crawler works (more or less) everywhere. The Wikipedia dump solution works on Wikipedia dumps.
Also mind: This is tied in with search engines and other places, where the AI bot follows links from search results etc. thus they'd need the extra logic to detect a Wikipedia link, then find the matching article in the dump, and then add the original link back as reference for the source.
Also in one article on that I read about spikes around death from people etc. in that scenario they want the latest version of the article, not a day old dump.
So yeah, I guess they used the simple straight forward way and didn't care much about consequences.
> Since AI crawlers tend to bulk read pages, they access obscure pages that have to be served from the core data center.
So it doesn't seem to be driven by "Search the web for keywords, follow links, slurp content" but trying to read a bulk of pages all together, then move on to another set of bulk pages, suggesting mass-ingestion, not just acting as a user-agent for an actual user.
But maybe I'm reading too much into the specifics of the article, I don't have any particular internal insights to the problem they're facing I'll confess.
It's entirely possible they don't know about this. I certainly didn't until just now.
Because grifters have no respect or care for other people, nor are they interested in learning how to be efficient. They only care about the least amount of effort for the largest amount of personal profit. Why special-case Wikipedia, when they can just scratch their balls and turn their code loose? It’s not their own money they’re burning anyway; there are more chumps throwing money at them than they know what to do with, so it’s imperative they look competitive and hard at work.
---
I've noticed one crawling my copy of Gitea for the last few months - fetching every combination of https://server/commithash/filepath. My server isn't overloaded by this. It filled up the disk space by generating every possible snapshot, but I count that as a bug in Gitea, not an attack by the crawler. Still, the situation is very dumb, so I set my reverse proxy to feed it a variant of the Wikipedia home page on every AI crawler request for the last few days. The variation has several sections replaced with nonsense, both AI-generated and not. You can see it here: https://git.immibis.com/gptblock.html
I just checked, and they're still crawling, and they've gone 3 layers deep into the image tags of the page. Since every URL returns that page if you have the wrong user-agent, so do the images, but they happen to be in a relative path so I know how many layers deep they're looking.
Interestingly, if you ask ChatGPT to evaluate this page (GPT interactive page fetches are not blocked) it says it's a fake Wikipedia. You'd think they could use their own technology to evaluate pages.
---
nginx rules for your convenience - be prepared to adjust the filters according to the actual traffic you see in your logs
location = /gptblock.html {root /var/www/html;}
if ($http_user_agent ~* "https://openai.com/gptbot") {rewrite ^.*$ /gptblock.html last; break;}
if ($http_user_agent ~* "claudebot@anthropic.com") {rewrite ^.*$ /gptblock.html last; break;}
if ($http_user_agent ~* "https://developer.amazon.com/support/amazonbot") {rewrite ^.*$ /gptblock.html last; break;}
if ($http_user_agent ~* "GoogleOther") {rewrite ^.*$ /gptblock.html last; break;}*
To cause deliberate harm as a DDOS attack. Perhaps a better question is, why would companies who hope to replace human-curated static online information with their own generative service not use the cloak of "scraping" to take down their competition?
In the worst case, Wikipedia will have to require user login, which achieves the partial goal of making information inaccessible to the general public.
The bigger problem is that the LLMs are so good that their users no longer feel the need to visit these sites directly. It looks like the business model of most of our clients is becoming obsolete. My paycheck is downstream of that, and I don't see a fix for it.
There are captcha to block bots or at least make them pay money to solve them, some people in Linux community also made tools to combat that, i think something that use a little cpu energy.
And in the same time, you offer an api, less expensive than the cost to crawl it, and everyone win.
Multi billions companies get their sweet sweet data, Wikipedia gets money to enhance their infrastructure or whatever, users benefits from Wikipedia quality engagement.
I've been dealing with this over at golfcourse.wiki for the last couple years. It fucking sucks. The good news is that all the idiot scrapers who don't follow robots.txt seem to fall for the honeypots pretty easily.
Make the honeypot disappear with a big CSS file, make another one disappear with a JS file. Humans aren't aware they are there, bots won't avoid them. Programming a bot to look for visible links instead of invisible links is challenging. The problem is these programmers are ubiquitous, and since they are ubiquitous they're not going to be geniuses.
Honeypot -> autoban
It suggest to me that people running AI crawlers are throwing resources at the problem with little thought.
I'm an entrepreneur who is going to get rich selling printed copies of Wikipedia. I'll pay you to fetch the content for me to print. You get 1000 compromised machines to use. Crawl Wikipedia and give me the data. Go.
Some candidates would (rightfully) point out that the entirety is available as an archive, so for "interviewing purposes" we'd have to ignore that fact.
If it went well, you would pivot back and forth: OK, you wrote a distributed crawler. Wikipedia hires you you to block it. What do you do? This cat and mouse game goes on indefinitely.
I also like Anna's (Creative Commons) framing of the problem being money + attribution + reciprocity.
https://stats.wikimedia.org/#/all-projects/reading/total-pag...
The resource consuming traffic is clearly explained in the linked post:
> This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.
I.e. difference between cached content at cdn edge vs hits to core services.
[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download
[1] https://en.wikipedia.org/wiki/Wikipedia:Database_download
these multi$B corps continue to leech off of everyone's labors, and no one seems able to stop them; at what level can entities take action? the courts? legislation?
we've basically handed over the Internet to a cabal of Big Tech
Obviously, OpenAI won't share their dataset. It's part of their competitive stance.
I don't have a point or solution. However, it seems wasteful for non-experts to be gathering the same data and reinventing the wheel.
The worst part is that every single sociopathic company in the world seems to have simultaneously unleashed their own fleet of crawlers.
Most of the bots downright ignore robots.txt, and some of the crawlers hit the site simultaneously from several IPs. I've been trying to lure the bots into a nepenthes tarpit, which somewhat helps, but ultimately find myself having to firewall entire IP ranges.
I think the most interesting thing here is that it shows that the companies doing these crawls simply don't care who they hurt, as they actively take measures to prevent their victims from stopping them by using multiple IP addresses, snowshoe crawling, evading fingerprinting, and so on.
For Wikipedia, there's a solution served up to them on a plate. But they simply can't be bothered to take it.
And this in turn shows the overall moral standards of those companies - it's the wild west out there, where the weak go to the wall, and those inflicting the damage know what they're doing, and just don't care. Sociopaths.
If someone makes a SETI@Home style daemon I'd contribute my home internet to this worthy goal
It's an interesting project, I wish there would be better ways to do that, but I guess we are on war with crawlers for a while already.
And this free information is not free from rights to respect neither, it's under CC-BY-SA, which requires attribution and sharing under the same conditions, the kind of "subtleties" and "details" with which AI companies have been wiping their big arses.
From their financial statement 2024 you can learn that they probably spend about $6,825,794 on site operation (excl. salaries etc.). This includes $3,116,445 for Internet hosting and an estimated $3,709,349 on Server infrastructure (est. as 85% of equipment).
Now as of June 30, 2024, the Wikimedia Foundation's net assets totaled approximately $271.6 million, and the Wikimedia Endowment, established to support the long-term sustainability of Wikimedia projects, reported net assets of approximately $144.3 million as of the same date, so combined approximately $415.9 million.
So yes, annual sites operation is about 1,64% of their total assets, and they can operate all the wikimedia sites till the end of time without raising a single dime in donations ever again.
Sure they're not going to advertise this fact when doing another donation drive as that would likely make the donators starting to ask pertinent questions about the exact purpose of their donations, but that is just marketing, not "corruption".
I think it’s reasonable to say any shady stuff is a form of corruption
So perhaps price comes with the greatness?
the "corruption" accusations are mostly BS and the usual ideological differences
I'll take Wikipedia, with all its warts, over $BigTech and $VC-driven (==Ad-driven) companies/orgs any day, and it's not even close.
I would like to read on Wikipedia corruption with quality sources (in a separate HN post, which would probably be successful), but that's not quite on-topic here. Not only it's off-topic and borderline whataboutism, it's also not sourced, so the comment doesn't actually help someone who isn't in the knows. Thus, as is, it's not much interesting and kinda useless.
These reasons are probably why it has been downvoted: off topic, not helping, not well researched.
Previous discussion: (2022) https://news.ycombinator.com/item?id=32840097