Wikipedia is struggling with voracious AI bot crawlers (opens in new tab)

(engadget.com)

91 pointsbretpiatt1y ago100 comments

100 comments

This has to be one of strangest targets to crawl, since they themselves make database dumps available for download (https://en.wikipedia.org/wiki/Wikipedia:Database_download) and if that wasn't enough, there are 3rd party dumps as well (https://library.kiwix.org/#lang=eng&category=wikipedia) that you could use if the official ones aren't good enough for some reason.

Why would you crawl the web interface when the data is so readily available in a even better format?

roenxi1y ago

I've written some unfathomably bad web crawlers in the past. Indeed, web crawlers might be the most natural magnet for bad coding and eye-twitchingly questionable architectural practices I know of. While it likely isn't the major factor here I can attest that there are coders who see pages-articles-multistream.xml.bz2 and then reach for a wget + HTML parser combo.

If you don't live and breath Wikipedia it is going to soak up a lot of time figuring out Wikipedia's XML format and markup language, not to mention re-learning how to parse XML. HTTP requests and bashing through the HTML is all everyday web skills and familiar scripting that is more reflexive and well understood. The right way would probably be much easier but figuring it out will take too long.

Although that is all pre-ChatGPT logic. Now I'd start by asking it to solve my problem.

a21281y ago

You don't even need to deal with any XML formats or anything, they publish a complete dataset on Huggingface that's just a few lines to load in your Python training script

https://huggingface.co/datasets/wikimedia/wikipedia

jerf1y ago

To be a "good" web crawler, you have to go beyond "not bad coding". If you just write the natural "fetch page, fetch next page, retry if it fails" loop, notably, missing any sort of wait between fetches, so that you fetch as quickly as possible, you are already a pest. You don't even need multiple threads or machines to be a pest; a single machine on a home connection fetching pages as quickly as it can be already be a pest to a website with heavy backend computation or DB demands. Do an equally naive "run on a couple dozen threads" upgrade to your code and you expand the blast radius of your pestilence out to even more web sites.

Being a truly good web crawler takes a lot of work, and being a polite web crawler takes yet more different work.

And then, of course, you add the bad coding practices on top of it, ignoring robots.txt or using robots.txt as a list of URLs to scrape (which can be either deliberate or accidental), hammering the same pages over and over, preferentially "retrying" the very pages that are timing out because you found the page that locks the DB for 30 seconds in a hard query that even the website owners themselves didn't know was possible until you showed them by taking down the rest of their site in the process... it just goes downhill from there. Being "not bad" is already not good enough and there's plenty of "bad" out there.

marginalia_nu1y ago

I think most crawlers inevitably tend to turn into spaghetti code because of the number of weird corner cases you need to deal with.

Crawlers are also incredibly difficult to test in a comprehensive way. No matter what test scenarios you come up with, there's a hundred more weird cases in the wild. (e.g. there's a world's difference between a server taking a long time to respond to a request, and a server sending headers quickly but taking a long time to send the body)

1 more reply

soco1y ago

You'd probably ask ChatGPT to write you a crawler for Wikipedia, without thinking to ask whether there's a better way to get Wikipedia info. So that download would be missed, because how and what we ask AI stays very important. Actually this is not new, googling skills were known as being important before and even philosophers recognized that asking good questions was crucial.

joquarky1y ago

> Why would you crawl the web interface when the data is so readily available in a even better format?

Have you seen the lack of experience that is getting through the hiring process lately? It feels like 80% of the people onboarding are only able to code to pre-existing patterns without an ability to think outside the box.

I'm just bitter because I have 25 years of experience and can't even get a damn interview no matter how low I go on salary expectations. I obviously have difficulty in the soft skills department, but companies who need real work to get done reliably used to value technical skills over social skills.

Cthulhu_1y ago

Because the scrapers they use aren't targeted, they just try to index the whole internet. It's easier that way.

johannes12343211y ago

While the dump may be simpler to consume, building it isn't simpler.

The generic web crawler works (more or less) everywhere. The Wikipedia dump solution works on Wikipedia dumps.

Also mind: This is tied in with search engines and other places, where the AI bot follows links from search results etc. thus they'd need the extra logic to detect a Wikipedia link, then find the matching article in the dump, and then add the original link back as reference for the source.

Also in one article on that I read about spikes around death from people etc. in that scenario they want the latest version of the article, not a day old dump.

So yeah, I guess they used the simple straight forward way and didn't care much about consequences.

diggan1y ago

I'm not sure this is what is currently affecting them the most, the article mentions this:

> Since AI crawlers tend to bulk read pages, they access obscure pages that have to be served from the core data center.

So it doesn't seem to be driven by "Search the web for keywords, follow links, slurp content" but trying to read a bulk of pages all together, then move on to another set of bulk pages, suggesting mass-ingestion, not just acting as a user-agent for an actual user.

But maybe I'm reading too much into the specifics of the article, I don't have any particular internal insights to the problem they're facing I'll confess.

marginalia_nu1y ago

I think most of these crawlers just aren't very well implemented. Takes a lot of time and effort to get crawling to work well, very easy to accidentally DoS a website if you don't pay attention.

skywhopper1y ago

Because AI and everything about it is about being as lazy as possible and wasting money and compute in service of becoming even lazier. No one who is willing to burn the compute necessary to train the big models we see would think twice about the wasted resources involved in doing the most wasteful, least efficient means of collecting the data possible.

is_true1y ago

This is what you get when an AI generates your code and your prompts are vague.

cowsaymoo1y ago

Vibe coded crawlers

wslh1y ago

I think those crawlers are just very generic: they basically operate like wget scripts, without much logic for avoiding sites that already offer clean data dumps.

ldng1y ago

That is not an excuse. Wikipedia isn't just any site.

wslh1y ago

Not an excuse, a plausible explanation of what's actually happening.

1 more reply

iamacyborg1y ago

With the way Transclusion works in MediaWiki, dumps and the wiki api’s are often not very useful, unfortunately

mzajc1y ago

There are crawlers that will recursively crawl source repository web interfaces (cgit et al, usually expensive to render) despite having a readily available URL they could clone from. At this point I'm not far from assuming malice over sheer incompetence.

MiscIdeaMaker991y ago

> Why would you crawl the web interface when the data is so readily available in a even better format?

It's entirely possible they don't know about this. I certainly didn't until just now.

shreyshnaccount1y ago

i was thinking the same thing. it might be the case that the scrapers are result of the web search feature in llms?

latexr1y ago

> Why would you crawl the web interface when the data is so readily available in a even better format?

Because grifters have no respect or care for other people, nor are they interested in learning how to be efficient. They only care about the least amount of effort for the largest amount of personal profit. Why special-case Wikipedia, when they can just scratch their balls and turn their code loose? It’s not their own money they’re burning anyway; there are more chumps throwing money at them than they know what to do with, so it’s imperative they look competitive and hard at work.

coldpie1y ago

The vast, vast majority of companies using AI are on the same level as the people distributing malware to mine crypto on other peoples' machines. They're exploiting resources that aren't theirs to get rich quick from stupid investors & market hype. We all suffer so they can get a couple bucks. Thanks, AI & braindead investors. This bubble can't pop soon enough, and I hope it takes a whole lot of terrible people down with it.

1 more reply

immibis1y ago

The crawler companies just do not give a shit. They're running these crawl jobs because they can, the methodology is worthless, the data will be worthless, but they have so much computing resources relative to developer resources that it costs them more to figure out that the crawl is worthless and figure out what isn't worthless, than it does to just do the crawl and throw away the worthless data at the end and then crawl again. Meanwhile they perform the internet's most widespread DDoS (which is against the CFAA btw so if they caused actual damages to you, try suing them). I don't personally take an issue with web crawling as a concept (how else would search engines work? oh, they don't work any more anyway) but the implementation is obviously a failure.

---

I've noticed one crawling my copy of Gitea for the last few months - fetching every combination of https://server/commithash/filepath. My server isn't overloaded by this. It filled up the disk space by generating every possible snapshot, but I count that as a bug in Gitea, not an attack by the crawler. Still, the situation is very dumb, so I set my reverse proxy to feed it a variant of the Wikipedia home page on every AI crawler request for the last few days. The variation has several sections replaced with nonsense, both AI-generated and not. You can see it here: https://git.immibis.com/gptblock.html

I just checked, and they're still crawling, and they've gone 3 layers deep into the image tags of the page. Since every URL returns that page if you have the wrong user-agent, so do the images, but they happen to be in a relative path so I know how many layers deep they're looking.

Interestingly, if you ask ChatGPT to evaluate this page (GPT interactive page fetches are not blocked) it says it's a fake Wikipedia. You'd think they could use their own technology to evaluate pages.

---

nginx rules for your convenience - be prepared to adjust the filters according to the actual traffic you see in your logs

location = /gptblock.html {root /var/www/html;}

if ($http_user_agent ~* "https://openai.com/gptbot") {rewrite ^.*$ /gptblock.html last; break;}

if ($http_user_agent ~* "claudebot@anthropic.com") {rewrite ^.*$ /gptblock.html last; break;}

if ($http_user_agent ~* "https://developer.amazon.com/support/amazonbot") {rewrite ^.*$ /gptblock.html last; break;}

if ($http_user_agent ~* "GoogleOther") {rewrite ^.*$ /gptblock.html last; break;}*

nonrandomstring1y ago

> Why would you crawl the web interface when the data is so readily available in a even better format?

To cause deliberate harm as a DDOS attack. Perhaps a better question is, why would companies who hope to replace human-curated static online information with their own generative service not use the cloak of "scraping" to take down their competition?

concerndc1tizen1y ago

This is the most reasonable explanation. Wikipedia is openly opposed by the current US administration, and 'denial of service' is key to their strategy (i.e. tariffs, removal of rights/due process, breaking net neutrality, etc.).

In the worst case, Wikipedia will have to require user login, which achieves the partial goal of making information inaccessible to the general public.

nonrandomstring1y ago

In the worst case Wikipedia will have to relocate to Europe and block the entire ASN of US network estates. But if the United States is determined to commit digital and economic suicide, I don't see how reasonable people can stop that.

1 more reply

delichon1y ago

We're having the same trouble for a few hundred sites that we manage. It is no problem for crawlers that obey robots.txt since we ask for one visit per 10 seconds, which is manageable. The problem seems to be mostly the greedy bots that request as fast as we can reply. So my current plan is to set rate limiting for everyone, bots or not. But doing stats on the logs, it isn't easy to figure out a limit that won't bounce legit human visitors.

The bigger problem is that the LLMs are so good that their users no longer feel the need to visit these sites directly. It looks like the business model of most of our clients is becoming obsolete. My paycheck is downstream of that, and I don't see a fix for it.

MadVikingGod1y ago

I wonder if there is a WAF that has an exponential backoff and constant decay for delay. Something like start a 10us and decay 1us/s.

aucisson_masque1y ago

People got to make bots pay. That's the only way to get rid of this world wide DDOSing backed up by multi billions companies.

There are captcha to block bots or at least make them pay money to solve them, some people in Linux community also made tools to combat that, i think something that use a little cpu energy.

And in the same time, you offer an api, less expensive than the cost to crawl it, and everyone win.

Multi billions companies get their sweet sweet data, Wikipedia gets money to enhance their infrastructure or whatever, users benefits from Wikipedia quality engagement.

guerrilla1y ago

This is an interesting model in general: free for humans, pay for automation. How do you enforce that though? Captchas sounds like a waste.

jerf1y ago

Any plan that starts with "Step one: Apply the tool that almost perfectly distinguishes human traffic from non-human traffic" is doomed to failure. That's whatever the engineering equivalent of "begging the question" is, where the solution to the problem is that we assume that we have the solution to the problem.

zokier1y ago

Identity verification is not that far fetched these days. For europeans you got eIDAS and related tech, some other places have similar stuff, for rest of world you can do video based id checks. There are plenty of providers that handle this, it's pretty commonplace stuff.

2 more replies

karn971y ago

Why not just rate limit every user to realistic human rates. You just punish anyone behaving like a bot.

mrweasel1y ago

Because, as pointed out in another post about the same problem: Many of these scrappers make one or two requests from one IP and then move on.

guerrilla1y ago

Sold. Pay by page retrieval rate.

scoofy1y ago

Honeypots in JS and CSS

I've been dealing with this over at golfcourse.wiki for the last couple years. It fucking sucks. The good news is that all the idiot scrapers who don't follow robots.txt seem to fall for the honeypots pretty easily.

Make the honeypot disappear with a big CSS file, make another one disappear with a JS file. Humans aren't aware they are there, bots won't avoid them. Programming a bot to look for visible links instead of invisible links is challenging. The problem is these programmers are ubiquitous, and since they are ubiquitous they're not going to be geniuses.

Honeypot -> autoban

graemep1y ago

Wikipedia provides dumps. Probably cheaper and easier than crawling it. Given the size of Wikipedia it would be well worth a little extra code. it also avoids the risk of getting blocked, and is more reliable.

It suggest to me that people running AI crawlers are throwing resources at the problem with little thought.

voidUpdate1y ago

Maybe they just vibe-coded the crawlers and that's why they don't work very well or know the best way to do it

ldng1y ago

Maybe they should just ... not "vibe-code" at all then ?

voidUpdate1y ago

Sounds great to me

milesrout1y ago

We shouldn't use that term. "Vibe coding". Nope. Brainless coding. That is what it is. It's what beginners do: programming without understanding what they--or their programs--are doing. The additional use of a computer program that brainlessly generates brainless code to complement their own brainless code doesn't mean we should call what they are doing by a new name.

laz1y ago

10 years ago at Facebook we had a systems design interview question called "botnet crawl" where the set up that I'd give would be:

I'm an entrepreneur who is going to get rich selling printed copies of Wikipedia. I'll pay you to fetch the content for me to print. You get 1000 compromised machines to use. Crawl Wikipedia and give me the data. Go.

Some candidates would (rightfully) point out that the entirety is available as an archive, so for "interviewing purposes" we'd have to ignore that fact.

If it went well, you would pivot back and forth: OK, you wrote a distributed crawler. Wikipedia hires you you to block it. What do you do? This cat and mouse game goes on indefinitely.

chuckadams1y ago

We need to start cutting off whole ASNs of ISPs that host such crawlers and distribute a spamhaus-style block list to that effect. WP should throttle them to serve like one page per minute.

schneems1y ago

I was on a panel with the President of Wikimedia LLC at SXSW and this was brought up. There's audio attached https://schedule.sxsw.com/2025/events/PP153044.

I also like Anna's (Creative Commons) framing of the problem being money + attribution + reciprocity.

PeterStuer1y ago

The weird thing is their own data does not reflect this at all. The number of articles accessed by users, spiders and bots alike has not moved significantly over the last few years. Why these strange wordings like "65 percent of the resource-consuming traffic"? Is there non-resource consuming traffic? Is this just another fundraising marketing drive? Wikimedia has been know to be less than truthful wrt their funding needs and spent.

https://stats.wikimedia.org/#/all-projects/reading/total-pag...

diggan1y ago

The graph you linked seems to be about article viewing ("page views", like a GET request to https://en.wikipedia.org/wiki/Democracy for example), while the article mentions multimedia content, so fetching the actual bytes of https://en.wikipedia.org/wiki/Democracy#/media/File:Economis... for example, which would consume more content than just loading article pages, as far as I understand.

zokier1y ago

multimedia content vs articles. It's easy to see how bad scraping of videos and images pushes bandwidth up more than just scraping articles.

The resource consuming traffic is clearly explained in the linked post:

> This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.

I.e. difference between cached content at cdn edge vs hits to core services.

perching_aix1y ago

I thought all of Wikipedia can be downloaded directly if that's the goal? [0] Why scrape?

[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download

netsharc1y ago

Someone's gotta tell the LLMs that when a prompt-kiddie asks them to build a scraper bot that "I suggest downloading the database instead".

tiagod1y ago

This is the first time I'm reading "prompt-kiddie", made me chuckle hard. Jumped straight into my vocabulary :-)

werdnapk1y ago

Turns out AI isn't smart enough to figure this out yet.

wslh1y ago

Wouldn't downloading the publicly available Wikipedia database (e.g. via Torrent [1]) be enough for AI training purposes? I get that this doesn't actually stop AI bots, but captchas and other restrictions would undermine the open nature of Wikipedia.

[1] https://en.wikipedia.org/wiki/Wikipedia:Database_download

1 more reply

jerven1y ago

Working for an open-data project, I am starting to believe that the AI companies are basically criminal enterprises. If I did this kind of thing to them they would call the cops and say I am a criminal for breaking TOS and doing a DDOS, therefore they are likely to be criminal organizations and their CEOs should be in Alcatraz.

shreyshnaccount1y ago

they will ddos the open internet to the point where only big tech will be able to afford to host even the most basic websites? is that the endgame?

insane_dreamer1y ago

> "expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement."

these multi$B corps continue to leech off of everyone's labors, and no one seems able to stop them; at what level can entities take action? the courts? legislation?

we've basically handed over the Internet to a cabal of Big Tech

iJohnDoe1y ago

Not sure why these AI companies need to scrape and crawl. Just seems like a waste when companies like OpenAI have already done this.

Obviously, OpenAI won't share their dataset. It's part of their competitive stance.

I don't have a point or solution. However, it seems wasteful for non-experts to be gathering the same data and reinventing the wheel.

qwertox1y ago

Maybe the big tech providers should play fair and host the downloadable database for those bots as well as crawlable mirrors.

skydhash1y ago

The worst thing about that is that wikipedia has dumps of all its data which you can download.

microtherion1y ago

Not just Wikipedia. My home server (hosting a number of not particularly noteworthy things, such as my personal gitea instance) has been absolutely hammered in recent months, to the extent of periodically bringing down the server for hours with thrashing.

The worst part is that every single sociopathic company in the world seems to have simultaneously unleashed their own fleet of crawlers.

Most of the bots downright ignore robots.txt, and some of the crawlers hit the site simultaneously from several IPs. I've been trying to lure the bots into a nepenthes tarpit, which somewhat helps, but ultimately find myself having to firewall entire IP ranges.

PeterStuer1y ago

Why not just rate limit? IP range based blocking will likely hit far more legitimate users than you think.

microtherion1y ago

I know how I can block IPs in my router, but I'm not sure how I can rate limit. And I don't want to rate limit on the server level, because that in itself places load on the server.

lambdaone1y ago

It's not just Wikipedia - the entire rest of the open-access web is suffering with them.

I think the most interesting thing here is that it shows that the companies doing these crawls simply don't care who they hurt, as they actively take measures to prevent their victims from stopping them by using multiple IP addresses, snowshoe crawling, evading fingerprinting, and so on.

For Wikipedia, there's a solution served up to them on a plate. But they simply can't be bothered to take it.

And this in turn shows the overall moral standards of those companies - it's the wild west out there, where the weak go to the wall, and those inflicting the damage know what they're doing, and just don't care. Sociopaths.

spiderfarmer1y ago

Truth. I have a platform with over a million photos. It costs me a lot of bandwidth.

ddtaylor1y ago

IPFS

greenavocado1y ago

Careless crawler scum will put an end to the open Internet

amazingamazing1y ago

what's the best way to stop the bots? cloudflare?

surfaceofthesun1y ago

I believe that's what this project aims to do: https://github.com/TecharoHQ/anubis

compass_copium1y ago

DDoS the hell out of any IP address running an AI scraper bot

If someone makes a SETI@Home style daemon I'd contribute my home internet to this worthy goal

gherard55551y ago

my guess is the gnome anime girl anti bot captcha

szszrk1y ago

mandatory link: https://github.com/TecharoHQ/anubis

It's an interesting project, I wish there would be better ways to do that, but I guess we are on war with crawlers for a while already.

briandear1y ago

Why should we stop the bots? Wikipedia supposedly wants the world to have this free information, a bit is just another way of supporting that goal.

richiebful11y ago

Because there's a better way for bots to get the data via the wikipedia database dump. Sending some large zip archives is a lot cheaper than individually serving every page on Wikipedia.

jraph1y ago

Not at all costs though, including disrupting access to said free information.

And this free information is not free from rights to respect neither, it's under CC-BY-SA, which requires attribution and sharing under the same conditions, the kind of "subtleties" and "details" with which AI companies have been wiping their big arses.

coldpie1y ago

The answer to your question is in the article.

franktankbank1y ago

make pain for the people they serve.

fareesh1y ago

"async_visit_link for link in links omg it works"

m1011y ago

Wikipedia spends 1% of its budget on hosting fees. It can spend a bit more given the rest of their corruptions.

PeterStuer1y ago

Steady on cowboy. They might mislead the public in their needs and spends, but to jump from that to "corruption" is a bit too fast and loose.

From their financial statement 2024 you can learn that they probably spend about $6,825,794 on site operation (excl. salaries etc.). This includes $3,116,445 for Internet hosting and an estimated $3,709,349 on Server infrastructure (est. as 85% of equipment).

Now as of June 30, 2024, the Wikimedia Foundation's net assets totaled approximately $271.6 million, and the Wikimedia Endowment, established to support the long-term sustainability of Wikimedia projects, reported net assets of approximately $144.3 million as of the same date, so combined approximately $415.9 million.

So yes, annual sites operation is about 1,64% of their total assets, and they can operate all the wikimedia sites till the end of time without raising a single dime in donations ever again.

Sure they're not going to advertise this fact when doing another donation drive as that would likely make the donators starting to ask pertinent questions about the exact purpose of their donations, but that is just marketing, not "corruption".

ViktorRay1y ago

You say it’s not corruption but it’s still shady as hell.

I think it’s reasonable to say any shady stuff is a form of corruption

PeterStuer1y ago

Shady as hell, sure. Misleading? At least misdirecting. But corruption? No.

If you have an olive green shirt and a grass green shirt, you can call both of them green, but calling the olive one grass green is a mistake.

1 more reply

limaoscarjuliet1y ago

That was always my feeling - however big Wikipedia is, however many requests it handles, it is not big enough to warrant burning so many millions every year. Alas, I never studied the subject, as overall I feel Wikipedia is good force in the Universe.

So perhaps price comes with the greatness?

insane_dreamer1y ago

> the rest of their corruptions

the "corruption" accusations are mostly BS and the usual ideological differences

I'll take Wikipedia, with all its warts, over $BigTech and $VC-driven (==Ad-driven) companies/orgs any day, and it's not even close.

PeterStuer1y ago

You don't need a VC if your sitting on $415,9 Million.

insane_dreamer1y ago

They didn't start with $415M, and they didn't take VC money to get there either -- that's the point.

1 more reply

limaoscarjuliet1y ago

BTW, I do not know why you are getting downvoted, this is a real concern that someone needs to tackle one day.

jraph1y ago

Real concern or not, this is not related to the discussion at hand, which is AI crawlers hammering Wikipedia, which is related to AI crawlers hammering everything these days. Here's the concern at hand.

I would like to read on Wikipedia corruption with quality sources (in a separate HN post, which would probably be successful), but that's not quite on-topic here. Not only it's off-topic and borderline whataboutism, it's also not sourced, so the comment doesn't actually help someone who isn't in the knows. Thus, as is, it's not much interesting and kinda useless.

These reasons are probably why it has been downvoted: off topic, not helping, not well researched.

limaoscarjuliet1y ago

Fair enough!

m1011y ago

I think there are a lot of blinkered folks who don't see the corruption with Wikipedia. What wiki are experiencing with bots is probably ubiquitous with every website yet the violins come out for them, one of the most corrupt institutions around.

briandear1y ago

It’s getting downvoted because the parent comment aligns with what Elon said about Wikipedia; so it’s a knee jerk reaction. Though the sentiment is factual.

Previous discussion: (2022) https://news.ycombinator.com/item?id=32840097

insane_dreamer1y ago

> sentiment is factual

the sentiment might exist, but that doesn't mean it's based on facts

that discussion you linked to can be broken down into:

- people upset because they thought Wikipedia was almost bankrupt and it turns out its not (though Wikipedia never claimed to be in its fund-raising)

- people upset because they see too many requests for donations

- people upset because Wikimedia execs are getting "high" salaries (though they are much _much_ lower than at private co's)

- people upset because they think Wikipedia spreads "left-wing ideologies"

None of this has anything to do with "corruption".

j / k navigate · click thread line to collapse

100 comments

diggan1y ago

Why would you crawl the web interface when the data is so readily available in a even better format?

roenxi1y ago

Although that is all pre-ChatGPT logic. Now I'd start by asking it to solve my problem.

a21281y ago

You don't even need to deal with any XML formats or anything, they publish a complete dataset on Huggingface that's just a few lines to load in your Python training script

https://huggingface.co/datasets/wikimedia/wikipedia

jerf1y ago

Being a truly good web crawler takes a lot of work, and being a polite web crawler takes yet more different work.

marginalia_nu1y ago

I think most crawlers inevitably tend to turn into spaghetti code because of the number of weird corner cases you need to deal with.

1 more reply

soco1y ago

joquarky1y ago

> Why would you crawl the web interface when the data is so readily available in a even better format?

Cthulhu_1y ago

Because the scrapers they use aren't targeted, they just try to index the whole internet. It's easier that way.

johannes12343211y ago

While the dump may be simpler to consume, building it isn't simpler.

The generic web crawler works (more or less) everywhere. The Wikipedia dump solution works on Wikipedia dumps.

Also in one article on that I read about spikes around death from people etc. in that scenario they want the latest version of the article, not a day old dump.

So yeah, I guess they used the simple straight forward way and didn't care much about consequences.

diggan1y ago

I'm not sure this is what is currently affecting them the most, the article mentions this:

> Since AI crawlers tend to bulk read pages, they access obscure pages that have to be served from the core data center.

But maybe I'm reading too much into the specifics of the article, I don't have any particular internal insights to the problem they're facing I'll confess.

marginalia_nu1y ago

I think most of these crawlers just aren't very well implemented. Takes a lot of time and effort to get crawling to work well, very easy to accidentally DoS a website if you don't pay attention.

skywhopper1y ago

is_true1y ago

This is what you get when an AI generates your code and your prompts are vague.

cowsaymoo1y ago

Vibe coded crawlers

wslh1y ago

I think those crawlers are just very generic: they basically operate like wget scripts, without much logic for avoiding sites that already offer clean data dumps.

ldng1y ago

That is not an excuse. Wikipedia isn't just any site.

wslh1y ago

Not an excuse, a plausible explanation of what's actually happening.

1 more reply

iamacyborg1y ago

With the way Transclusion works in MediaWiki, dumps and the wiki api’s are often not very useful, unfortunately

mzajc1y ago

MiscIdeaMaker991y ago

> Why would you crawl the web interface when the data is so readily available in a even better format?

It's entirely possible they don't know about this. I certainly didn't until just now.

shreyshnaccount1y ago

i was thinking the same thing. it might be the case that the scrapers are result of the web search feature in llms?

latexr1y ago

> Why would you crawl the web interface when the data is so readily available in a even better format?

coldpie1y ago

1 more reply

immibis1y ago

---

nginx rules for your convenience - be prepared to adjust the filters according to the actual traffic you see in your logs

location = /gptblock.html {root /var/www/html;}

if ($http_user_agent ~* "https://openai.com/gptbot") {rewrite ^.*$ /gptblock.html last; break;}

if ($http_user_agent ~* "claudebot@anthropic.com") {rewrite ^.*$ /gptblock.html last; break;}

if ($http_user_agent ~* "https://developer.amazon.com/support/amazonbot") {rewrite ^.*$ /gptblock.html last; break;}

if ($http_user_agent ~* "GoogleOther") {rewrite ^.*$ /gptblock.html last; break;}*

nonrandomstring1y ago

> Why would you crawl the web interface when the data is so readily available in a even better format?

concerndc1tizen1y ago

In the worst case, Wikipedia will have to require user login, which achieves the partial goal of making information inaccessible to the general public.

nonrandomstring1y ago

1 more reply

delichon1y ago

MadVikingGod1y ago

I wonder if there is a WAF that has an exponential backoff and constant decay for delay. Something like start a 10us and decay 1us/s.

aucisson_masque1y ago

People got to make bots pay. That's the only way to get rid of this world wide DDOSing backed up by multi billions companies.

There are captcha to block bots or at least make them pay money to solve them, some people in Linux community also made tools to combat that, i think something that use a little cpu energy.

And in the same time, you offer an api, less expensive than the cost to crawl it, and everyone win.

Multi billions companies get their sweet sweet data, Wikipedia gets money to enhance their infrastructure or whatever, users benefits from Wikipedia quality engagement.

guerrilla1y ago

This is an interesting model in general: free for humans, pay for automation. How do you enforce that though? Captchas sounds like a waste.

jerf1y ago

zokier1y ago

2 more replies

karn971y ago

Why not just rate limit every user to realistic human rates. You just punish anyone behaving like a bot.

mrweasel1y ago

Because, as pointed out in another post about the same problem: Many of these scrappers make one or two requests from one IP and then move on.

guerrilla1y ago

Sold. Pay by page retrieval rate.

scoofy1y ago

Honeypots in JS and CSS

Honeypot -> autoban

graemep1y ago

It suggest to me that people running AI crawlers are throwing resources at the problem with little thought.

voidUpdate1y ago

Maybe they just vibe-coded the crawlers and that's why they don't work very well or know the best way to do it

ldng1y ago

Maybe they should just ... not "vibe-code" at all then ?

voidUpdate1y ago

Sounds great to me

milesrout1y ago

laz1y ago

10 years ago at Facebook we had a systems design interview question called "botnet crawl" where the set up that I'd give would be:

Some candidates would (rightfully) point out that the entirety is available as an archive, so for "interviewing purposes" we'd have to ignore that fact.

If it went well, you would pivot back and forth: OK, you wrote a distributed crawler. Wikipedia hires you you to block it. What do you do? This cat and mouse game goes on indefinitely.

chuckadams1y ago

We need to start cutting off whole ASNs of ISPs that host such crawlers and distribute a spamhaus-style block list to that effect. WP should throttle them to serve like one page per minute.

schneems1y ago

I was on a panel with the President of Wikimedia LLC at SXSW and this was brought up. There's audio attached https://schedule.sxsw.com/2025/events/PP153044.

I also like Anna's (Creative Commons) framing of the problem being money + attribution + reciprocity.

PeterStuer1y ago

https://stats.wikimedia.org/#/all-projects/reading/total-pag...

diggan1y ago

zokier1y ago

multimedia content vs articles. It's easy to see how bad scraping of videos and images pushes bandwidth up more than just scraping articles.

The resource consuming traffic is clearly explained in the linked post:

> This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources.

I.e. difference between cached content at cdn edge vs hits to core services.

perching_aix1y ago

I thought all of Wikipedia can be downloaded directly if that's the goal? [0] Why scrape?

[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download

netsharc1y ago

Someone's gotta tell the LLMs that when a prompt-kiddie asks them to build a scraper bot that "I suggest downloading the database instead".

tiagod1y ago

This is the first time I'm reading "prompt-kiddie", made me chuckle hard. Jumped straight into my vocabulary :-)

werdnapk1y ago

Turns out AI isn't smart enough to figure this out yet.

wslh1y ago

[1] https://en.wikipedia.org/wiki/Wikipedia:Database_download

1 more reply

jerven1y ago

shreyshnaccount1y ago

they will ddos the open internet to the point where only big tech will be able to afford to host even the most basic websites? is that the endgame?

insane_dreamer1y ago

> "expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement."

these multi$B corps continue to leech off of everyone's labors, and no one seems able to stop them; at what level can entities take action? the courts? legislation?

we've basically handed over the Internet to a cabal of Big Tech

iJohnDoe1y ago

Not sure why these AI companies need to scrape and crawl. Just seems like a waste when companies like OpenAI have already done this.

Obviously, OpenAI won't share their dataset. It's part of their competitive stance.

I don't have a point or solution. However, it seems wasteful for non-experts to be gathering the same data and reinventing the wheel.

qwertox1y ago

Maybe the big tech providers should play fair and host the downloadable database for those bots as well as crawlable mirrors.

skydhash1y ago

The worst thing about that is that wikipedia has dumps of all its data which you can download.

microtherion1y ago

The worst part is that every single sociopathic company in the world seems to have simultaneously unleashed their own fleet of crawlers.

PeterStuer1y ago

Why not just rate limit? IP range based blocking will likely hit far more legitimate users than you think.

microtherion1y ago

I know how I can block IPs in my router, but I'm not sure how I can rate limit. And I don't want to rate limit on the server level, because that in itself places load on the server.

lambdaone1y ago

It's not just Wikipedia - the entire rest of the open-access web is suffering with them.

For Wikipedia, there's a solution served up to them on a plate. But they simply can't be bothered to take it.

spiderfarmer1y ago

Truth. I have a platform with over a million photos. It costs me a lot of bandwidth.

ddtaylor1y ago

IPFS

greenavocado1y ago

Careless crawler scum will put an end to the open Internet

amazingamazing1y ago

what's the best way to stop the bots? cloudflare?

surfaceofthesun1y ago

I believe that's what this project aims to do: https://github.com/TecharoHQ/anubis

compass_copium1y ago

DDoS the hell out of any IP address running an AI scraper bot

If someone makes a SETI@Home style daemon I'd contribute my home internet to this worthy goal

gherard55551y ago

my guess is the gnome anime girl anti bot captcha

szszrk1y ago

mandatory link: https://github.com/TecharoHQ/anubis

It's an interesting project, I wish there would be better ways to do that, but I guess we are on war with crawlers for a while already.

briandear1y ago

Why should we stop the bots? Wikipedia supposedly wants the world to have this free information, a bit is just another way of supporting that goal.

richiebful11y ago

Because there's a better way for bots to get the data via the wikipedia database dump. Sending some large zip archives is a lot cheaper than individually serving every page on Wikipedia.

jraph1y ago

Not at all costs though, including disrupting access to said free information.

coldpie1y ago

The answer to your question is in the article.

franktankbank1y ago

make pain for the people they serve.

fareesh1y ago

"async_visit_link for link in links omg it works"

m1011y ago

Wikipedia spends 1% of its budget on hosting fees. It can spend a bit more given the rest of their corruptions.

PeterStuer1y ago

Steady on cowboy. They might mislead the public in their needs and spends, but to jump from that to "corruption" is a bit too fast and loose.

So yes, annual sites operation is about 1,64% of their total assets, and they can operate all the wikimedia sites till the end of time without raising a single dime in donations ever again.

ViktorRay1y ago

You say it’s not corruption but it’s still shady as hell.

I think it’s reasonable to say any shady stuff is a form of corruption

PeterStuer1y ago

Shady as hell, sure. Misleading? At least misdirecting. But corruption? No.

If you have an olive green shirt and a grass green shirt, you can call both of them green, but calling the olive one grass green is a mistake.

1 more reply

limaoscarjuliet1y ago

So perhaps price comes with the greatness?

insane_dreamer1y ago

> the rest of their corruptions

the "corruption" accusations are mostly BS and the usual ideological differences

I'll take Wikipedia, with all its warts, over $BigTech and $VC-driven (==Ad-driven) companies/orgs any day, and it's not even close.

PeterStuer1y ago

You don't need a VC if your sitting on $415,9 Million.

insane_dreamer1y ago

They didn't start with $415M, and they didn't take VC money to get there either -- that's the point.

1 more reply

limaoscarjuliet1y ago

BTW, I do not know why you are getting downvoted, this is a real concern that someone needs to tackle one day.

jraph1y ago

These reasons are probably why it has been downvoted: off topic, not helping, not well researched.

limaoscarjuliet1y ago

Fair enough!

m1011y ago

briandear1y ago

It’s getting downvoted because the parent comment aligns with what Elon said about Wikipedia; so it’s a knee jerk reaction. Though the sentiment is factual.

Previous discussion: (2022) https://news.ycombinator.com/item?id=32840097

insane_dreamer1y ago

> sentiment is factual

the sentiment might exist, but that doesn't mean it's based on facts

that discussion you linked to can be broken down into:

- people upset because they thought Wikipedia was almost bankrupt and it turns out its not (though Wikipedia never claimed to be in its fund-raising)

- people upset because they see too many requests for donations

- people upset because Wikimedia execs are getting "high" salaries (though they are much _much_ lower than at private co's)

- people upset because they think Wikipedia spreads "left-wing ideologies"

None of this has anything to do with "corruption".

j / k navigate · click thread line to collapse