Why are anime catgirls blocking my access to the Linux kernel?

908 comments

TFA — and most comments here — seem to completely miss what I thought was the main point of Anubis: it counters the crawler's "identity scattering"/sybil'ing/parallel crawling.

Any access will fall into either of the following categories:

- client with JS and cookies. In this case the server now has an identity to apply rate limiting to, from the cookie. Humans should never hit it, but crawlers will be slowed down immensely or ejected. Of course the identity can be rotated — at the cost of solving the puzzle again.

- amnesiac (no cookies) clients with JS. Each access is now expensive.

(- no JS - no access.)

The point is to prevent parallel crawling and overloading the server. Crawlers can still start an arbitrary number of parallel crawls, but each one costs to start and needs to stay below some rate limit. Previously, the server would collapse under thousands of crawler requests per second. That is what Anubis is making prohibitively expensive.

qwery9mo ago

Yes, I think you're right. The commentary about its (presumed, imagined) effectiveness is very much making the assumption that it's designed to be an impenetrable wall[0] -- i.e. prevent bots from accessing the content entirely.

I think TFA is generally quite good and has something of a good point about the economics of the situation, but finding the math shake out that way should, perhaps, lead one to question their starting point / assumptions[1].

In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?

[0] Mentioning 'impenetrable wall' is probably setting off alarm bells, because of course that would be a bad design.

[1] (Edited to add:) I should say 'to question their assumptions more' -- like I said, the article is quite good and it does present this as confusing, at least.

1 more reply

thayne9mo ago

You don't necessarily need JS, you just need something that can detect if Anybis is used and complete the challenge.

eqvinox9mo ago

Sure, doesn't change anything though; you still need to spend energy on a bunch of hash calculations.

rocqua9mo ago

But then you rate limit that challenge.

You could setup a system for parellelizing the creation of these Anubis PoW cookies independent of the crawling logic. That would probably work, but it's a pretty heavy lift compared to 'just run a browser with JavaScript'.

rocqua9mo ago

This is a good point, presuming the rate limiting is actually applied.

dlenski9mo ago

> Crawlers can still start an arbitrary number of parallel crawls, but each one costs to start and needs to stay below some rate limit.

This is a nice explanation. It's much clearer than anything I've seen offered by Anubis’s authors, in terms of why or how it could be effective at preventing a site from being ravaged by hordes of ill-behaved bots.

IshKebab9mo ago

Well maybe, but even then, how many parallel crawls are you going to do per site? 100 maybe? You can still get enough keys to do that for all sites in just a few hours per week.

wraptile9mo ago

I'm a scraper developer and Anubis would have worked 10 - 20 years ago, but now all broad scrapers run on a real headless browser with full cookie support and costs relatively nothing in compute. I'd be surprised if LLM bots would use anything else given the fact that they have all of this compute and engineers already available.

That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale. It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.

You can actually see this in real life if you google web scraping services and which targets they claim to bypass - all of them bypass generic anti-bots like Cloudflare, Akamai etc. but struggle with custom and rare stuff like Chinese websites or small forums because scraping market is a market like any other and high value problems are solved first. So becoming a low value problem is a very easy way to avoid confrontation.

jandrese9mo ago

> That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale.

Isn't this what Microsoft is trying to do with their sliding puzzle piece and choose the closest match type systems?

Also, if you come in on a mobile browser it could ask you to lay your phone flat and then shake it up and down for a second or something similar that would be a challenge for a datacenter bot pretending to be a phone.

DanielHB9mo ago

How do you bypass cloudflare? I do some light scrapping for some personal stuff, but I can't figure out how to bypass it. Like do you randomize IPs using several VPNs at the same time?

I usually just sit there on my phone pressing the "I am not a robot box" when it triggers.

5 more replies

miki1232119mo ago

This only works if you're a low-value site (which admittedly most sites are).

hahn-kev9mo ago

Bot blocking through obscurity

2 more replies

sam0x179mo ago

> It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.

These are trivial for an AI agent to solve though, even with very dumb watered down models.

andai9mo ago

You can also generate custom solutions at scale with LLMs. So each user could get a different CAPTCHA.

1 more reply

Arnavion9mo ago

>This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.

>I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.

No need to mimic the actual challenge process. Just change your user agent to not have "Mozilla" in it; Anubis only serves you the challenge if it has that. For myself I just made a sideloaded browser extension to override the UA header for the handful of websites I visit that use Anubis, including those two kernel.org domains.

(Why do I do it? For most of them I don't enable JS or cookies for so the challenge wouldn't pass anyway. For the ones that I do enable JS or cookies for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)

johnecheck9mo ago

Sadly, touching the user-agent header more or less instantly makes you uniquely identifiable.

Browser fingerprinting works best against people with unique headers. There's probably millions of people using an untouched safari on iPhone. Once you touch your user-agent header, you're likely the only person in the world with that fingerprint.

sillywabbit9mo ago

If someone's out to uniquely identify your activity on the internet, your User-Agent string is going to be the least of your problems.

1 more reply

Arnavion9mo ago

UA fingerprinting isn't a problem for me. As I said I only modify the UA for the handful of sites that use Anubis that I visit. I trust those sites enough that them fingerprinting me is unlikely, and won't be a problem even if they did.

NoMoreNicksLeft9mo ago

I'll set mine to "null" if the rest of you will set yours...

1 more reply

codedokode9mo ago

If your headers are new every time then it is very difficult to figure out who is who.

4 more replies

andrewmcwatters9mo ago

Yes, but you can take the bet, and win more often than not, that your adversary is most likely not tracking visitor probabilities if you can detect that they aren't using a major fingerprinting provider.

jagged-chisel9mo ago

I wouldn’t think the intention is to s/Mozilla// but to select another well-known UA string.

3 more replies

Animats9mo ago

> (Why do I do it? For most of them I don't enable JS so the challenge wouldn't pass anyway. For the ones that I do enable JS for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)

Hm. If your site is "sticky", can it mine Monero or something in the background?

We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"

mikestew9mo ago

We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"

Doesn't Safari sort of already do that? "This tab is using significant power", or summat? I know I've seen that message, I just don't have a good repro.

1 more reply

zahlman9mo ago

> Just change your user agent to not have "Mozilla" in it. Anubis only serves you the challenge if you have that.

Won't that break many other things? My understanding was that basically everyone's user-agent string nowadays is packed with a full suite of standard lies.

Arnavion9mo ago

It doesn't break the two kernel.org domains that the article is about, nor any of the others I use. At least not in a way that I noticed.

throwawayffffas9mo ago

In 2025 I think most of the web has moved on from checking user strings. Your bank might still do it but they won't be running Anubis.

2 more replies

msephton9mo ago

I'm interested in your extension. I'm wondering if I could do something similar to force text encoding of pages into Japanese.

Arnavion9mo ago

If your Firefox supports sideloading extensions then making extensions that modify request or response headers is easy.

All the API is documented in https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web... . My Anubis extension modifies request headers using `browser.webRequest.onBeforeSendHeaders.addListener()` . Your case sounds like modifying response headers which is `browser.webRequest.onHeadersReceived.addListener()` . Either way the API is all documented there, as is the `manifest.json` that you'll need to write to register this JS code as a background script and whatever permissions you need.

Then zip the manifest and the script together, rename the zip file to "<id_in_manifest>.xpi", place it in the sideloaded extensions directory (depends on distro, eg /usr/lib/firefox/browser/extensions), restart firefox and it should show up. If you need to debug it, you can use the about:debugging#/runtime/this-firefox page to launch a devtools window connected to the background script.

1 more reply

semiquaver9mo ago

Doesn’t that just mean the AI bots can do the same? So what’s the point?

danieltanfh959mo ago

wtf? how is this then better than a captcha or something similar?!

ksymph9mo ago

This is neither here nor there but the character isn't a cat. It's in the name, Anubis, who is an Egyptian deity typically depicted as a jackal or generic canine, and the gatekeeper of the afterlife who weighs the souls of the dead (hence the tagline). So more of a dog-girl, or jackal-girl if you want to be technical.

esperent9mo ago

Every representation I've ever seen of Anubis - including remarkably well preserved statues from antiquity - are either a male human body with a canine head, or fully canine.

This anime girl is not Anubis. It's a modern cartoon characters that simply borrows the name because it sounds cool, without caring anything about the history or meaning behind it.

Anime culture does this all the time, drawing on inspiration from all cultures but nearly always only paying the barest lip service to the original meaning.

I don't have an issue with that, personally. All cultures and religions should be fair game as inspiration for any kind of art. But I do have an issue with claiming that the newly inspired creation is equivalent in any way to the original source just because they share a name and some other very superficial characteristics.

account429mo ago

It's also that the anime style already makes all heads shaped vaguely like felines. Add upwards pointing furry ears and it's not wrong to call it a cat girl.

ksymph9mo ago

> they share a name and some other very superficial characteristics.

I wasn't implying anything more than that, although now I see the confusing wording in my original comment. All I meant to say was that between the name and appearance it's clear the mascot is canid rather than feline. Not that the anime girl with dog ears is an accurate representation of the Egyptian deity haha.

SnuffBox9mo ago

It's refreshing to see a reply as thought out as this in today's day and and age of "move fast and post garbage".

qwery9mo ago

I think you're taking it a bit too seriously. In turn, I am, of course, also taking it too seriously.

> I do have an issue with claiming that the newly inspired creation is equivalent in any way to the original source

Nobody is claiming that the drawing is Anubis or even a depiction of Anubis, like the statues etc. you are interested in. It's a mascot. "Mascot design by CELPHASE" -- it says, in the screenshot.

Generally speaking -- I can't say that this is what happened with this project -- you would commission someone to draw or otherwise create a mascot character for something after the primary ideation phase of the something. This Anubis-inspired mascot is, presumably, Anubis-inspired because the project is called Anubis, which is a name with fairly obvious connections to and an understanding of "the original source".

> Anime culture does this all the time, ...

I don't know what bone you're picking here. This seems like a weird thing to say. I mean, what anime culture? It's a drawing on a website. Yes, I can see the manga/anime influence -- it's a very popular, mainstream artform around the world.

1 more reply

Sincere60669mo ago

irrelevant. still a doggirl.

ChrisRR9mo ago

I'm assuming the aversion is more about why young anime girls are popping up, not about what animal it is

1 more reply

pak9rabid9mo ago

Well, thank you for that. That's a great weight off me mind.

JdeBP9mo ago

... but entirely lacking the primary visual feature that Anubis had.

johnklos9mo ago

This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.

Sure, the people who make the AI scraper bots are going to figure out how to actually do the work. The point is that they hadn't, and this worked for quite a while.

As the botmakers circumvent, new methods of proof-of-notbot will be made available.

It's really as simple as that. If a new method comes out and your site is safe for a month or two, great! That's better than dealing with fifty requests a second, wondering if you can block whole netblocks, and if so, which.

This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.

palata9mo ago

> they are just feigning a lack of understanding to be dismissive of Anubis.

I actually find the featured article very interesting. It doesn't feel dismissive of Anubis, but rather it questions whether this particular solution makes sense or not in a constructive way.

johnklos9mo ago

I agree - the article is interesting and not dismissive.

I was talking more about some of the people here ;)

1 more reply

agwa9mo ago

It sounds like you're saying that it's not the proof-of-work that's stopping AI scrapers, but the fact that Anubis imposes an unusual flow to load the site.

If that's true Anubis should just remove the proof-of-work part, so legitimate human visitors don't have to stare at a loading screen for several seconds while their device wastes electricity.

chrismorgan9mo ago

> If that's true Anubis should just remove the proof-of-work part

This is my very strong belief. To make it even clearer how absurd the present situation is, every single one of the proof-of-work systems I’ve looked at has been using SHA-256, which is basically the worst choice possible.

Proof-of-work is bad rate limiting which depends on a level playing field between real users and attackers. This is already a doomed endeavour. Using SHA-256 just makes it more obvious: there’s an asymmetry factor in the order of tens of thousands between common real-user hardware and software, and pretty easy attacker hardware and software. You cannot bridge such a divide. If you allow the attacker to augment it with a Bitcoin mining rig, the efficiency disparity factor can go up to tens of millions.

These proof-of-work systems are only working because attackers haven’t tried yet. And as long as attackers aren’t trying, you can settle for something much simpler and more transparent.

If they were serious about the proof-of-work being the defence, they’d at least have started with something like Argon2d.

2 more replies

kaszanka9mo ago

This is basically what most of the challenge types in go-away (https://git.gammaspectra.live/git/go-away/wiki/Challenges) do.

1 more reply

amarant9mo ago

I feel like the future will have this, plus ads displayed while the work is done, so websites can profit while they profit.

2 more replies

tptacek9mo ago

Exactly this.

empath759mo ago

I don't think anything will stop AI companies for long. They can do spot AI agentic checks of workflows that stop working for some reason and the AI can usually figure out what the problem is and then update the workflow to get around it.

tptacek9mo ago

Respectfully, I think it's you missing the point here. None of this is to say you shouldn't use Anubis, but Tavis Ormandy is offering a computer science critique of how it purports to function. You don't have to care about computer science in this instance! But you can't dismiss it because it's computer science.

Consider:

An adaptive password hash like bcrypt or Argon2 uses a work function to apply asymmetric costs to adversaries (attackers who don't know the real password). Both users and attackers have to apply the work function, but the user gets ~constant value for it (they know the password, so to a first approx. they only have to call it once). Attackers have to iterate the function, potentially indefinitely, in the limit obtaining 0 reward for infinite cost.

A blockchain cryptocurrency uses a work function principally as a synchronization mechanism. The work function itself doesn't have a meaningfully separate adversary. Everyone obtains the same value (the expected value of attempting to solve the next round of the block commitment puzzle) for each application of the work function. And note in this scenario most of the value returned from the work function goes to a small, centralized group of highly-capitalized specialists.

A proof-of-work-based antiabuse system wants to function the way a password hash functions. You want to define an adversary and then find a way to incur asymmetric costs on them, so that the adversary gets minimal value compared to legitimate users.

And this is in fact how proof-of-work-based antispam systems function: the value of sending a single spam message is so low that the EV of applying the work function is negative.

But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.

There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.

This is also how the Blu-Ray BD+ system worked.

The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).

The problem with "this is good because none of the scrapers even bother to do this POW yet" is that you don't need an annoying POW to get that value! You could just write a mildly complicated Javascript function, or do an automated captcha.

sugarpimpdorsey9mo ago

A lot of these passive types of anti-abuse systems rely on the rather bold assumption that making a bot perform a computation is expensive, but isn't for me as an ordinary user.

According to whom or what data exactly?

AI operators are clearly well-funded operations and the amount of electricity and CPU power is negligible. Software like Anubis and nearly all its identical predecessors grant you access after a single "proof". So you then have free reign to scrape the whole site.

The best physical analogy are those shopping cart things where you have to insert a quarter to unlock the cart, and you presumably get it back when you return the cart.

The group of people this doesn't affect are the well-funded, a quarter is a small price to pay for leaving your cart in the middle of the parking lot.

Those that suffer the most are the ones that can't find a quarter in the cupholder so you're stuck filling your arms with groceries.

Would you be richer if they didn't charge you a quarter? (For these anti-bot tools you're paying the electric company, not the site owner.). Maybe. But if you're Scrooge McDuck who is counting?

3 more replies

xena9mo ago

For what it's worth, kernel.org seems to be running an old version of Anubis that predates the current challenge generation method. Previously it took information about the user request, hashed it, and then relied on that being idempotent to avoid having to store state. This didn't scale and was prone to issues like in the OP.

The modern version of Anubis as of PR https://github.com/TecharoHQ/anubis/pull/749 uses a different flow. Minting a challenge generates state including 64 bytes of random data. This random data is sent to the client and used on the server side in order to validate challenge solutions.

The core problem here is that kernel.org isn't upgrading their version of Anubis as it's released. I suspect this means they're also vulnerable to GHSA-jhjj-2g64-px7c.

2 more replies

landhar9mo ago

> But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.

Based on my own experience fighting these AI scrappers, I feel that the way they are actually implemented makes it that in practice there is asymmetry in the work scrappers have to do vs humans.

The pattern these scrappers follow is that they are highly distributed. I’ll see a given {ip, UA} pair make a request to /foo immediately followed by _hundreds_ of requests from completely different {ip, UA} pairs to all the links from that page (ie: /foo/a, /foo/b, /foo/c, etc..).

This is a big part of what makes these AI crawlers such a challenge for us admins. There isn’t a whole lot we can do to apply regular rate limiting techniques: the IPs are always changing and are no longer limited to corporate ASN (I’m now seeing IPs belonging to consumer ISPs and even cell phone companies), and the User Agents all look genuine. But when looking through the logs you can see the pattern that all these unrelated requests are actually working together to perform a BFS traversal of your site.

Given this pattern, I believe that’s what makes the Anubis approach actually work well in practice. For a given user, they will encounter the challenge once when accessing the site the first time, then they’ll be able to navigate through it without incurring any cost. While the AI scrappers would need to solve the challenge for every single one of their “nodes” (or whatever it is they would call their {ip, UA} pairs). From a site reliability perspective, I don’t even care if the crawlers manage to solve the challenge or not. That it manages to slow them down enough to rate limit them as a network is enough.

To be clear: I don’t disagree with you that the cost incurred by regular human users is still high. But I don’t think it’s fair to say that this is not a situation in which the cost to the adversary is not asymmetrical. It wouldn’t be if the AI crawlers hadn’t converged towards an implementation that behaves as a DDOS botnet.

akoboldfrying9mo ago

The (almost only?) distinguishing factor between genuine users and bots is the total volume of requests, but this can still be used for asymmetric costs. If botPain > botPainThreshold and humanPain < humanPainThreshold then Anubis is working as intended. A key point is that those inequalities look different at the next level of detail. A very rough model might be:

botPain = nBotRequests * cpuWorkPerRequest * dollarsPerCpuSecond

humanPain = c_1 * max(elapsedTimePerRequest) + c_2 * avg(elapsedTimePerRequest)

The article points out that the botPain Anubis currently generates is unfortunately much too low to hit any realistic threshold. But if the cost model I've suggested above is in any way realistic, then useful improvements would include:

1. More frequent but less taxing computation demands (this assumes c_1 >> c_2)

2. Parallel computation (this improves the human experience with no effect for bots)

ETA: Concretely, regarding (1), I would tolerate 500ms lag on every page load (meaning forget about the 7-day cookie), and wouldn't notice 250ms.

1 more reply

seba_dos19mo ago

> The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).

No, that's missing the point. Anubis is effectively a DDoS protection system, all the talking about AI bots comes from the fact that the latest wave of DDoS attacks was initiated by AI scrapers, whether intentionally or not.

If these bots would clone git repos instead of unleashing the hordes of dumbest bots on Earth pretending to be thousands and thousands of users browsing through git blame web UI, there would be no need for Anubis.

1 more reply

account429mo ago

> There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.

That depends on what you count as normal users though. Users that want to use alternative players also have to deal with this and since yt-dlp and youtube-dl before have been able to provide a solution for those user and bots can just do the same I'm not sure if I'd call the scheme successful in any way.

hedora9mo ago

This was obviously dumb when it launched:

1) scrapers just run a full browser and wait for the page to stabilize. They did this before this thing launched, so it probably never worked.

2) The AI reading the page needs something like 5 seconds * 1600W to process it. Assuming my phone can even perform that much compute as efficiently as a server class machine, it’d take a large multiple of five seconds to do it, and get stupid hot in the process.

Note that (2) holds even if the AI is doing something smart like batch processing 10-ish articles at once.

pilif9mo ago

> This was obviously dumb when it launched:

Yes. Obviously dumb but also nearly 100% successful at the current point in time.

And likely going to stay successful as the non-protected internet still provides enough information to dumb crawlers that it’s not financially worth it to even vibe-code a workaround.

Or in other words: Anubis may be dumb, but the average crawler that completely exhausting some sites resources is even dumber.

And so it all works out.

And so the question remains: how dumb was it exactly, when it works so well and continues to work so well?

4 more replies

pama9mo ago

I agree. Your estimate for (2), about 0.0022 kWh, corresponds to about a sixth of the charge of an iPhone 15 pro and would take longer than ten minutes on the phone, even at max power draw. It feels about right for the amount of energy/compute of a large modern MoE loading large pages of several 10k tokens. For example this tech (couple month old) could input 52.3k tokens per second to a 672B parameter model, per H100 node instance, which probably burns about 6–8kW while doing it. The new B200s should be about 2x to 3x more energy efficient, but your point still holds within an order of magnitude.

https://lmsys.org/blog/2025-05-05-large-scale-ep/

rob_c9mo ago

The argument doesn't quite hold. The mass scraping (for training) is almost never doing by a GPU system it's almost always done by a dedicated system running a full chrome fork in some automated way (not just the signatures but some bugs give that away).

And frankly processing a single page of text is run within a single token window so likely is run for a blink (ms) before moving onto the next data entry. The kicker is it's run over potentially thousands of times depending on your training strategy.

At inference there's now a dedicated tool that may perform a "live" request to scrape the site contents. But then this is just pushed into a massive context window to give the next token anyway.

1 more reply

technion9mo ago

It really should be recognised just how many people are watching Cloudflare interstitials on nearly every site these days (and I totally get why this happens) yet making a huge amount of noise about Anubis on a very small amount of sites.

mlyle9mo ago

I don't trip over CloudFlare except when in a weird VPN, and then it always gets out of my way after the challenge.

Anubis screws with me a lot, and often doesn't work.

9 more replies

elric9mo ago

I hit Cloudflare's garbage about as much as I hit Anubis. With the difference that far more sites use Cloudflare than Anubis, thus Anubis is far worse at triggering false positives.

2 more replies

tgv9mo ago

That says something about the chosen picture, doesn't it? Probably that it's not well liked. It certainly isn't neutral, while the Cloudfare page is.

3 more replies

jcelerier9mo ago

Both are equally terrible - one doesn't require explanations to my boss though

2 more replies

petralithic9mo ago

We can make noise about both things, and how they're ruining the internet.

account429mo ago

Cloudflare's solution works without javascript enabled unless the website turns up the scare level to max or you are on an IP with already bad reputation. Anubis does not.

But at the end of the day both are shit and we should not accept either. That includes not using one as an excuse for the other.

1 more reply

lupusreal9mo ago

Over the past few years I've read far more comments complaining about Cloudflare doing it than Anubis. In fact, this discussion section is the first time I've seen people talking about Anubis.

ronsor9mo ago

TO BE FAIR

I dislike those even more.

psionides9mo ago

The problem is that 7 + 2 on a submission form only affects people who want to submit something, Anubis affects every user who wants to read something on your site

account429mo ago

The question then is why read only users are consuming so much resources that serving them big chunks of JS instead reduces loads of the server. Maybe improve you rendering and/or caching before employing DRM solutions that are doomed to fail anyway.

1 more reply

monooso9mo ago

The author make it very clear that he understands the problem Anubis is attempting to solve. His issue is that the chosen approach doesn't solve that problem; it just inhibits access to humans, particularly those with limited access to compute resources.

That's the opposite of being dismissive. The author has taken the time to deeply understand both the problem and the proposed solution, and has taken the time to construct a well-researched and well-considered argument.

Aurornis9mo ago

> This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.

This is a confusing comment because it appears you don’t understand the well-written critique in the linked blog post.

> This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.

The key point in the blog post is that it’s the inverse of a CAPTCHA: The proof of work requirement is solved by the computer automatically.

You don’t have to teach a computer how to solve this proof of work because it’s designed for the computer to solve the proof of work.

It makes the crawling process more expensive because it has to actually run scripts on the page (or hardcode a workaround for specific versions) but from a computational perspective that’s actually easier and far more deterministic than trying to have AI solve visual CAPTCHA challenges.

necovek9mo ago

But for actual live users who don't see anything but a transient screen, Anubis is a better experience than all those pesky CAPTCHAs (I am bored of trying to recognize bikes, pedestrian crossings, buses, hydrants).

The question is if this is the sweet spot, and I can't find anyone doing the comparative study (how many annoyed human visitors, how many humans stopped and, obviously, how many bots stopped).

1 more reply

cakealert9mo ago

This arms race will have a terminus. The bots will eventually be indistinguishable from humans. Some already are.

overfeed9mo ago

> The bots will eventually be indistinguishable from humans

Not until they get issued government IDs they won't!

Extrapolating from current trends, some form of online ID attestation (likely based on government-issued ID[1]) will become normal in the next decade, and naturally, this will be included in the anti-bot arsenal. It will be up to the site operator to trust identities signed by the Russian government.

1. Despite what Sam Altman's eyeball company will try to sell you, government registers will always be the anchor of trust for proof-of-identity, they've been doing it for centuries and have become good at it and have earned the goodwill.

10 more replies

kjkjadksj9mo ago

Maybe there will be a way to certify humanness. Human testing facility could be a local office you walk over to get your “I am a human” hardware key. Maybe it expires after a week or so to ensure that you are still alive.

1 more reply

neumann9mo ago

It will be hard to tune them to be just the right level of ignorant and slow as us though!

1 more reply

TylerE9mo ago

No, it’s exactly because I understand that it bothers me. I understand it will be effective against bots for a few months and best, and legitimate human users will be stuck dealing with the damn thing for years to come. Just like captchas.

ehnto9mo ago

It's been going on for decades now too. It's a cat and mouse game that will be with us for as long as people try to exploit online resources with bots. Which will be until the internet is divided into nation nets, suffocated by commercial interests, and we all decide to go play outside instead.

1 more reply

interstice9mo ago

The cost benefit calculus for workarounds changes based on popularity. Your custom lock might be easy to break by a professional, but the handful of people who might ever care to pick it are unlikely to be trying that hard. A lock which lets you into 5% of houses however might be worth learning to break.

necovek9mo ago

> The point is that they hadn't, and this worked for quite a while.

That's what I was hoping to get from the "Numbers" section.

I generally don't look up the logs or numbers on my tiny, personal web spaces hosted on my server, and I imagine I could, at some point, become the victim of aggressive crawling (or maybe I have without noticing because I've got an oversized server on a dual link connection).

But the numbers actually only show the performance of doing the PoW, not the effect it has had on any site — I am just curious, and I'd love it if someone has done the analysis, ideally grouped by the bot type ("OpenAI bot was responsible for 17% of all requests, this got reduced from 900k requests a day to 0 a day"...). Search, unfortunately, only gives me all the "Anubis is helping fight aggressive crawling" blog articles, nothing with substance (I haven't tried hard, I admit).

Edit: from further down the thread there's https://dukespace.lib.duke.edu/server/api/core/bitstreams/81... but no analysis of how many real customers were denied — more data would be even better

topranks9mo ago

Sure.

It might be a tool in the box. But it’s still cat and mouse.

In my place we quickly concluded the scrapers have tons of compute and the “proof-of-work” aspect was meaningless to them. It’s simply the “response from site changed, need to change our scraping code” aspect that helps.

rozab9mo ago

>But it takes a human some time and work to tell the crawler HOW.

Yes, for these human-based challenges. But this challenge is defined in code. It's not like crawlers don't run JavaScript. It's 2025, they all use headless browsers, not curl.

account429mo ago

If you are going to rely on security through obscurity there are plenty of ways to do that that won't block actual humans because they dare use a non-mainstream browser. You can also do it without displaying cringeworthy art that is only there to get people to pay for the DRM solution you are peddling - that shit has no place in the open source ecosystem.

1 more reply

dcow9mo ago

I deployed a proof of work based auth system once where every single request required hashing a new nonce. Compare with Anubis where only one request a week requires it. The math said doing it that frequently, and with variable argon params the server could tune if it suspected bots, would be impactful enough to deter bots.

Would I do that again? Probably not. These days I’d require a weekly mDL or equivalent credential presentation.

I have to disagree that an anti-bot measure that only works globally for a few weeks until bots trivially bypass it is effective. In an arms race against bots the bots win. You have to outsmart them by challenging them to do something that only a human can do or is actually prohibitively expensive for bots to do at scale. Anubis doesn't pass that test. And now it’s littered everywhere defunct and useless.

Kwpolska9mo ago

With all the SPAs out there, if you want to crawl the entire Web, you need a headless browser running JavaScript. Which will pass Anubis for free.

dwaite9mo ago

> As the botmakers circumvent, new methods of proof-of-notbot will be made available.

Yes, but the fundamental problem is that the AI crawler does the same amount of work as a legitimate user, not more.

So if you design the work such that it takes five seconds on a five year old smartphone, it could inconvenience a large portion of your user base. But once that scheme is understood by the crawler, it will delay the start of their aggressive crawling by... well-under five seconds.

An open source javascript challenge as a crawler blocker may work until it gets large enough for crawlers to care, but then they just have an engineer subscribe to changes on GitHub and have new challenge algorithms implemented before the majority of the deployment base migrates.

numpad09mo ago

Wasn't there also weird behaviors reported by webadmins across the world, like crawlers used by LLM companies are fetching evergreen data ad nauseum or something along that? I thought the point of adding PoW than just blocking them was to convince them to at least do it right.

raxxorraxor9mo ago

Everytime we need to deploy such mechanisms, you reward those that already crawled the data and you penalize newcomers and other honest crawlers.

For some sites Anubis might be fitting, but it should be mindfully deployed.

casey29mo ago

You don't even need to go there. If the damn thing didn't work the site admin wouldn't have added it and kept it.

Sure the program itself is jank in multiple ways but it solves the problem well enough.

windward9mo ago

Many sufficiently technical people take to heart:

- Everything is pwned

- Security through obscurity is bad

Without taking to heart:

- What a threat model is

And settle on a kind of permanent contrarian nihilist doomerism.

Why eat greens? You'll die one day anyway.

colordrops9mo ago

On a side note, is the anime girl image customizable? I did a quick Google search an it seems that only the commercial version offers rebranding.

boomboomsubban9mo ago

It's free software. The paid version includes an option to change it, and they ask politely that you don't change it otherwise.

TZubiri9mo ago

As I understand it, this is Proof of Work, which is strictly not a mouse and cat situation.

account429mo ago

It is because you are dealing with crawlers that already have a nontrivial cost per page, adding something relatively trivial that is still within the bounds regular users accept won't change the motivations of bad actors at all.

1 more reply

wat100009mo ago

Technical people are prone to black-and-white thinking, which makes it hard to understand that making something more difficult will cause people to do it less even though it’s still possible.

mattnewton9mo ago

I think the argument on offer is more, this juice isn't worth the squeeze. Each user is being slowed down and annoyed for something that bots will trivially bypass if they become aware of it.

1 more reply

ramblerman9mo ago

Did you read the article? OP doesn't care about bots figuring it out. It's about the compute needed to do the work.

It's quite an interesting piece, I feel like you projected something completely different onto it.

Your point is valid, but completely adjacent.

odo12429mo ago

Also, it forces the crawler to gain code execution capabilities, which for many companies will just make them give up and scrape someone else.

wredcoll9mo ago

I don't know if you've noticed, but there's a few websites these days that use javascript as part of their display logic.

1 more reply

sneak9mo ago

The fundamental failure of this is that you can’t publish data to the web and not publish data to the web. If you make things public, the public will use it.

It’s ineffective. (And furry sex-subculture propaganda pushed by its author, which is out of place in such software.)

2 more replies

rootsudo9mo ago

When I instantly read it, I knew it was anubis. I hope the anime catgirls never disapear from that project :)

hdndiebf9mo ago

This anime thing is the one thing about computer culture that I just don't seem to get. I did not get it as child, when suddenly half of children cartoons became animes and I just disliked the aestheic. I didn't get it in school, when people started reading mangas . I'll probably never get it. Therefore I sincerely hope, they do go away from anubis, so I can further dwell in my ignorance.

timcambrant9mo ago

I feel the same. It's a distinct part of nerd culture.

In the '70s, if you were into computers you were most likely also a fan of Star Trek. I remember an anecdote from the 1990s when an entire dial-up ISP was troubleshooting its modem pools because there were zero people connected and they assumed there was an outage. The outage happened to occur exactly while that week's episode of X-Files was airing in their time zone. Just as the credits rolled, all modems suddenly lit up as people connected to IRC and Usenet to chat about the episode. In ~1994 close to 100% of residential internet users also happened to follow X-Files on linear television. There was essentially a 1:1 overlap between computer nerds and sci-fi nerds.

Today's analog seems to be that almost all nerds love anime and Andy Weir books and some of us feel a bit alienated by that.

2 more replies

armada6519mo ago

But what if they choose a different image that you don't get? What if they used an abstract modern art piece that no one gets? Oh the horror!

Aachen9mo ago

You don't have to get it to be able to accept that others like it. Why not let them have their fun?

This sounds more as though you actively dislike anime than merely not seeing the appeal or being "ignorant". If you were to ignore it, there wouldn't be an issue...

1 more reply

balamatom9mo ago

Might've caught on because the animes had plots, instead of considering viewers to have the attention spans of idiots like Western kids' shows (and, in the 21st century, software) tend to do.

1 more reply

bawolff9mo ago

Its nice to see there is still some whimsy on the internet.

Everything got so corporate and sterile.

account429mo ago

Everyone copying the same Japanese cartoon style isn't any better than everyone copying corporate memphis.

2 more replies

ghssds9mo ago

As Anubis the egyptian god is represented as a dog-headed human, I thought the drawing was of a dog-girl.

nemomarx9mo ago

Perhaps a jackal girl? I guess "cat girl" gets used very broadly to mean kemomimi (pardon the spelling) though

1 more reply

Der_Einzige9mo ago

It's not the only project with an anime girl as its mascot.

ComfyUI has what I think is a foxgirl as its official mascot, and that's the de-facto primary UI for generating Stable Diffusion or related content.

SnuffBox9mo ago

I've noticed the word "comfy" used more than usual recently and often by the anime-obsessed, is there cultural relevance I'm not understanding?

1 more reply

bakugo9mo ago

It's more likely that the project itself will disappear into irrelevance as soon as AI scrapers bother implementing the PoW (which is trivial for them, as the post explains) or figure out that they can simply remove "Mozilla" from their user-agent to bypass it entirely.

debugnik9mo ago

> as AI scrapers bother implementing the PoW

That's what it's for, isn't it? Make crawling slower and more expensive. Shitty crawlers not being able to run the PoW efficiently or at all is just a plus. Although:

> which is trivial for them, as the post explains

Sadly the site's being hugged to death right now so I can't really tell if I'm missing part of your argument here.

> figure out that they can simply remove "Mozilla" from their user-agent

And flag themselves in the logs to get separately blocked or rate limited. Servers win if malicious bots identify themselves again, and forcing them to change the user agent does that.

3 more replies

skydhash9mo ago

It's more about the (intentional?) DDoS from AI scrappers, than preventing them from accessing the content. Bandwidth is not cheap.

unclad59689mo ago

Im not on Firefox or any Firefox derivative and I still get anime cat girls making sure I'm not a bot.

1 more reply

guappa9mo ago

We all know it's doomed

1 more reply

NelsonMinar9mo ago

¡Nyah!

bawolff9mo ago

> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.

Counterpoint - it seems to work. People use anubis because its the best of bad options.

If theory and reality disagree, it means either you are missing something or your theory is wrong.

semiquaver9mo ago

Counter-counter point: it only stopped them for a few weeks and now it doesn’t work: https://news.ycombinator.com/item?id=44914773

jeroenhd9mo ago

Geoblocking China and Singapore solves that problem, it seems, at least the non-residential IPs (though I also see a lot of aggressive bots coming from residential IP space from China).

I wish the old trick of sending CCP-unfriendly content to get the great firewall to kill the connection for you still worked, but in the days of TLS everywhere that doesn't seem to work anymore.

Aachen9mo ago

Only Huawei so far, no? That could be easy to block on a network level for the time being

Of course we knew from the beginning that this first stage of "bots don't even try to solve it, no matter the difficulty" isn't a forever solution

1 more reply

sidewndr469mo ago

> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans

I'm an unsure if this deadpan humor or if the author has never tried to solve a CAPTCHA that is something like "select the squares with an orthodox rabbi present"

Lammy9mo ago

I enjoyed the furor around the 2008 RapidShare catpcha lol

- https://www.htmlcenter.com/blog/now-thats-an-annoying-captch...

- https://depressedprogrammer.wordpress.com/2008/04/20/worst-c...

- https://medium.com/xato-security/a-captcha-nightmare-f6176fa...

classichasclass9mo ago

The problem with that CAPTCHA is you're not allowed to solve it on Saturdays.

windward9mo ago

I wonder if it's an intentional quirk that you can only pass some CAPTCHAs if you're a human who knows what an American fire hydrant or school bus looks like?

3 more replies

wingworks9mo ago

There are also services out that will solve any CAPTCHA for you at a very small cost to you. And an AI company will get steep discounts with the volumes of traffic they do.

There are some browser extensions for it too, like NopeCHA, it works 99% of the time and saves me the hassle of doing them.

Any site using CAPTCHA's today is really only hurting there real customers and low hanging fruit.

Of course this assumes they can't solve the capture themselves, with ai, which often they can.

petesergeant9mo ago

Yes, but not at a rate that enables them to be a risk to your hosting bill. My understanding is that the goal here isn't to prevent crawlers, it's to prevent overly aggressive ones.

bawolff9mo ago

Well the problem is that computers got good at basically everything.

Early 2000s captchas really were like that.

ok1234569mo ago

The original reCAPTCHA was doing distributed book OCR. It was sold as an altruistic project to help transcribe old books.

1 more reply

pkal9mo ago

Superficial comment regarding the catgirl, I don't get why some people are so adamant and enthusiastic for others to see it, but if you like me find it distasteful and annoying, consider copying these uBlock rules: https://sdf.org/~pkal/src+etc/anubis-ublock.txt. Brings me joy to know what I am not seeing whenever I get stopped by this page :)

squigz9mo ago

I don't get why so many people find it "distasteful and annoying"

3 more replies

sugarpimpdorsey9mo ago

Every time I see one of these I think it's a malicious redirect to some pervert-dwelling imageboard.

On that note, is kernel.org really using this for free and not the paid version without the anime? Linux Foundation really that desperate for cash after they gas up all the BMW's?

qualeed9mo ago

It's crazy (especially considering anime is more popular now than ever; netflix alone is making billions a year on anime) that people see a completely innocent little anime picture and immediately think "pervent-dwelling imageboard".

magicalhippo9mo ago

> people see a completely innocent little anime picture and immediately think "pervent-dwelling imageboard"

Think you can thank the furries for that.

Every furry I've happened to come across was very pervy in some way, and so that what immediately comes to mind when I see furry-like pictures like the one shown in the article.

YMMV

1 more reply

Seattle35039mo ago

To be fair, that's the sort of place where I spend most of my free time.

gruez9mo ago

"Anime pfp" stereotype is alive and well.

ants_everywhere9mo ago

they've seized the moment to move the anime cat girls off the Arch Linux desktop wallpapers and onto lore.kernel.org.

account429mo ago

It's not crazy at all that anyone who has been online for more than a day has that association.

turtletontine9mo ago

Even if the images aren’t the kind of sexualized (or downright pornographic) content this implies… having cutesy anime girls pop up when a user loads your site is, at best, wildly unprofessional. (Dare I say “cringe”?) For something as serious and legit as kernel.org to have this, I do think it’s frankly shocking and unacceptable.

7 more replies

Dilettante_9mo ago

For me it's the flipside: It makes me think "Ahh, my people!"

creatonez9mo ago

Huh, why would they need the unbranded version? The branded version works just fine. It's usually easier to deploy ordinary open source software than it is for software that needs to be licensed, because you don't need special download pages or license keys.

If it makes sense for an organization to donate to a project they rely on, then they should just donate. No need to debrand if it's not strictly required, all that would do is give the upstream project less exposure. For design reasons maybe? But LKML isn't "designed" at all, it has always exposed the raw ugly interface of mailing list software.

Also, this brand does have trust. Sure, I'm annoyed by these PoW captcha pages, but I'm a lot more likely to enable Javascript if it's the Anubis character, than if it is debranded. If it is debranded, it could be any of the privacy-invasive captcha vendors, but if it's Anubis, I know exactly what code is going to run.

rustystump9mo ago

If i saw an anime pic show up, thatd be a flag. I only know of Anubis’ existence and use of anime from hn.

It is only trusted by a small subset of people who are in the know. It is not about “anime bad” but that a large chunk of the population isnt into it for whatever reason.

I love anime but it can also be cringe. I find this cringe as it seems many others do too.

bogwog9mo ago

I wonder if the best solution is still just to create link mazes with garbage text like this: https://blog.cloudflare.com/ai-labyrinth/

It won't stop the crawlers immediately, but it might lead to an overhyped and underwhelming LLM release from a big name company, and force them to reassess their crawling strategy going forward?

ronsor9mo ago

That won't work, because garbage data is filtered after the full dataset is collected anyway. Every LLM trainer these days knows that curation is key.

bogwog9mo ago

If the "garbage data" is AI generated, it'll be hard or impossible to filter.

creatonez9mo ago

Crawlers already know how to stop crawling recursive or otherwise excessive/suspicious content. They've dealt with this problem long before LLM-related crawling.

WesolyKubeczek9mo ago

I disagree with the post author in their premise that things like Anubis are easy to bypass if you craft your bot well enough and throw the compute at it.

Thing is, the actual lived experience of webmasters tells that the bots that scrape the internets for LLMs are nothing like crafted software. They are more like your neighborhood shit-for-brain meth junkies competing with one another who makes more robberies in a day, no matter the profit.

Those bots are extremely stupid. They are worse than script kiddies’ exploit searching software. They keep banging the pages without regard to how often, if ever, they change. If they were 1/10th like many scraping companies’ software, they wouldn’t be a problem in the first place.

Since these bots are so dumb, anything that is going to slow them down or stop them in their tracks is a good thing. Short of drone strikes on data centers or accidents involving owners of those companies that provide networks of botware and residential proxies for LLM companies, it seems fairly effective, doesn’t it?

int_19h9mo ago

It is the way it is because there are easy pickings to be made even with this low effort, but the more sites adopt such measures, the less stupid your average bot will be.

busterarm9mo ago

Those are just the ones that you've managed to ID as bots.

Ask me how I know.

ok1234569mo ago

Why is kernel.org doing this for essentially static content? Cache control headers and ETAGS should solve this. Also, the Linux kernel has solved the C10K problem.

mixologic9mo ago

Because its static content that is almost never cached because its infrequently accessed. Thus, almost every hit goes to the origin.

ok1234569mo ago

The contents in question are statically generated, 1-3 KB HTML files. Hosting a single image would be the equivalent of cold serving 100s of requests.

Putting up a scraper shield seems like it's more of a political statement than a solution to a real technical problem. It's also antithetical to open collaboration and an open internet of which Linux is a product.

whatevaa9mo ago

Bots don't respect that.

1gn159mo ago

Use a CDN.

2 more replies

jimmaswell9mo ago

What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?

themafia9mo ago

If you want my help training up your billion dollar model then you should pay me. My content is for humans. If you're not a human you are an unwelcome burden.

Search engines, at least, are designed to index the content, for the purpose of helping humans find it.

Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.

This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."

jimmaswell9mo ago

> copyright attribution

You can't copyright an idea, only a specific expression of an idea. An LLM works at the level of "ideas" (in essence - for example if you subtract the vector for "woman" from "man" and add the difference to "king" you get a point very close to "queen") and reproduces them in new contexts and makes its own connections to other ideas. It would be absurd for you to demand attribution and payment every time someone who read your Python blog said "Python is dynamically type-checked and garbage-collected". Thankfully that's not how the law works. Abusive traffic is a problem, but the world is a better place if humans can learn from these ideas with the help of ChatGPT et al. and to say they shouldn't be allowed to just because your ego demands credit for every idea someone learns from you is purely selfish.

2 more replies

marvinborner9mo ago

As a reference on the volume aspect: I have a tiny server where I host some of my git repos. After the fans of my server spun increasingly faster/louder every week, I decided to log the requests [1]. In a single week, ClaudeBot made 2.25M (!) requests (7.55GiB), whereas GoogleBot made only 24 requests (8.37MiB). After installing Anubis the traffic went down to before the AI hype started.

[1] https://types.pl/@marvin/114394404090478296

squaresmile9mo ago

Same, ClaudeBot makes a stupid amount of requests on my git storage. I just blocked them all on Cloudflare.

dilDDoS9mo ago

As others have said, it's definitely volume, but also the lack of respecting robots.txt. Most AI crawlers that I've seen bombarding our sites just relentlessly scrape anything and everything, without even checking to see if anything has changed since the last time they crawled the site.

benou9mo ago

Yep, AI scrapers have been breaking our open-source project gerrit instance hosted at Linux Network Foundation.

Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me. This should be a solved problem. But it looks like this field is full of wrongly behaving companies with complete disregards toward common goods.

1 more reply

Philpax9mo ago

Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163

zahlman9mo ago

Why not just actually rate-limit everyone, instead of slowing them down with proof-of-work?

1 more reply

immibis9mo ago

Why haven't they been sued and jailed for DDoS, which is a felony?

2 more replies

ezrast9mo ago

High volume and inorganic traffic patterns. Wikimedia wrote about it here: https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

blibble9mo ago

they seem to be written by either idiots and/or people that don't give a shit about being good internet citizens

either way the result is the same: they induce massive load

well written crawlers will:

  - not hit a specific ip/host more frequently than say 1 req/5s
  - put newly discovered URLs at the end of a distributed queue (NOT do DFS per domain)
  - limit crawling depth based on crawled page quality and/or response time
  - respect robots.txt
  - make it easy to block them

Aachen9mo ago

- wait 2 seconds for a page to load before aborting the connection

- wait for the previous request to finish before requesting the next page, since that would only induce more load, get even slower, and eventually take everything down

I've designed my site to hold up to traffic spikes anyway and the bots I'm getting aren't as crazy as the ones I hear about from other, bigger website operators (like the OpenStreetMap wiki, still pretty niche), so I don't block much of them. Can't vet every visitor so they'll get the content anyway, whether I like it or not. But if I see a bot having HTTP 499 "client went away before page finished loading" entries in the access log, I'm not wasting my compute on those assholes. That's a block. I haven't had to do that before, in a decade of hosting my own various tools and websites

ChocolateGod9mo ago

I have a S24 (flagship of 2024) and Anubis often takes 10-20 seconds to complete, that time is going to add up if more and more sites adopt it, leaning to a worse browsing experience and wasted battery life.

Meanwhile AI farms will just run their own nuclear reactors eventually and be unaffected.

I really don't understand why someone thought this was a good idea, even if well intentioned.

prmoustache9mo ago

Something must be wrong on your flagship smartphone because I have an entry level one that doesn't take that long.

It seems there is a large number of operations crawling the web to build models that aren't using directly infrastructure hosted on AI farms BUT botnet running on commodity hardware and residencial networks to circumvent their ip range from being blacklisted. Anubis point is to block those.

Aachen9mo ago

Which browser and which difficulty setting is that?

Because I've got the same model line but about 3 or 4 years older and it usually just flashes by in the browser Lightning from F-droid which is an OS webview wrapper. On occasion a second or maybe two, I assume that's either bad luck in finding a solution or a site with a higher difficulty setting. Not sure if I've seen it in Fennec (firefox mobile) yet but, if so, it's the same there

I've been surprised that this low threshold stops bots but I'm reading in this thread that it's rather that bot operators mostly just haven't bothered implementing the necessary features yet. It's going to get worse... We've not even won the battle let alone the war. Idk if this is going to be sustainable, we'll see where the web ends up...

jeroenhd9mo ago

Either your phone is on some extreme power saving mode, your ad blocker is breaking Javascript, or something is wrong with your phone.

I've certainly seen Anubis take a few seconds (three or four maybe) but that was on a very old phone that barely loaded any website more complex than HN.

vova_hn9mo ago

I have Pixel 7 (released in 2022) and it usually takes less than a second...

TZubiri9mo ago

I remember that LiteCoin briefly had this idea, to be easy on consumer hardware but hard on GPUs. The ASICs didn't take long to obliterate the idea though.

Maybe there's going to be some form of pay per browse system? even if it's some negligible cost on the order of 1$ per month (and packaged with other costs), I think economies of scale would allow servers to perform a lifetime of S24 captchas in a couple of seconds.

whatevaa9mo ago

Something is wrong with your flagship if it takes that long.

ChocolateGod9mo ago

Samsung's UI has this feature where it turns on power saving mode when it detects light use.

prmoustache9mo ago

I guess his flagship IS compromised and part of an AI crawling botnet ;-)

Lammy9mo ago

You're looking at it wrong.

leumon9mo ago

Seems like ai bots are indeed bypassing the challenge by computing it: https://social.anoxinon.de/@Codeberg/115033790447125787

debugnik9mo ago

That's not bypassing it, that's them finally engaging with the PoW challenge as intended, making crawling slower and more expensive, instead failing to crawl at all, which is more of a plus.

This however forces servers to increase the challenge difficulty, which increases the waiting time for the first-time access.

nialv79mo ago

Obviously the developer of Anubis thinks it is bypassing: https://github.com/TecharoHQ/anubis/issues/978

1 more reply

NoGravitas9mo ago

The point is that it will always be cheaper for bot farms to pass the challenge than for regular users.

2 more replies

hiccuphippo9mo ago

Too bad the challenge's result is only a waste of electricity. Maybe they should do like some of those alt-coins and search for prime numbers or something similar instead.

2 more replies

danieltanfh959mo ago

this only holds through if the data to be accessed is less valuable than the computational cost. in this case, that is false and spending a few dollars to scrape data is more than worth.

reducing the problem to a cost issue is bound to be short sighted.

1 more reply

johnea9mo ago

My biggest bitch is that it requires JS and cookies...

Although the long term problem is the business model of servers paying for all network bandwidth.

Actual human users have consumed a minority of total net bandwidth for decades:

https://www.atom.com/blog/internet-statistics/

Part 4 shows bots out using humans in 1996 8-/

What are "bots"? This needs to include goggleadservices, PIA sharing for profit, real-time ad auctions, and other "non-user" traffic.

The difference between that and the LLM training data scraping, is that the previous non-human traffic was assumed, by site servers, to increase their human traffic, through search engine ranking, and thus their revenue. However the current training data scraping is likely to have the opposite effect: capturing traffic with LLM summaries, instead of redirecting it to original source sites.

This is the first major disruption to the internet's model of finance since ad revenue look over after the dot bomb.

So far, it's in the same category as the environmental disaster in progress, ownership is refusing to acknowledge the problem, and insisting on business as usual.

Rational predictions are that it's not going to end well...

jerf9mo ago

"Although the long term problem is the business model of servers paying for all network bandwidth."

Servers do not "pay for all the network bandwidth" as if they are somehow being targeted for fees and carrying water for the clients that are somehow getting it for "free". Everyone pays for the bandwidth they use, clients, servers, and all the networks in between, one way or another. Nobody out there gets free bandwidth at scale. The AI scrapers are paying lots of money to scrape the internet at the scales they do.

Imustaskforhelp9mo ago

The Ai scrapers are most likely vc funded and all they care about is getting as much data as possible and not worry about the costs.

They are hiring machines at scale too so definitely bandwidth etc. are cheaper for them too. Maybe use a provider that doesn't have too much bandwidth issues (hetzner?)

But still, the point being that you might be hosting website on your small server and that scraper with its machines beast can come and effectively ddos your server looking for data to scrape. Deterring them is what matters so that the economical scale finally slide back to our favours again.

johnea9mo ago

Maybe my statement wasn't clear. The point is that the server operators pay for all of the bandwidth of access to their servers.

When this access is beneficial to them, that's OK, when it's detrimental to them, they're paying for their own decline.

The statement isn't really concerned with what if anything the scraper operators are paying, and I don't think that really matters in reaching the conclusion.

Hizonner9mo ago

> The difference between that and the LLM training data scraping

Is the traffic that people are complaining about really training traffic?

My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.

That doesn't seem like enough traffic to be a really big problem.

On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.

Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.

Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.

So what's really going on here? Anybody actually know?

zerocrates9mo ago

The traffic I've seen is the big AI players just voraciously scraping for ~everything. What they do with it, if anything, who knows.

There's some user-directed traffic, but it's a small fraction, in my experience.

ncruces9mo ago

It's not random internet people saying it's training. It's Cloudflare, among others.

Search for “A graph of daily requests over time, comparing different categories of AI Crawlers” on this blog: https://blog.cloudflare.com/ai-labyrinth/

1 more reply

Dylan168079mo ago

The traffic I'm seeing on a wiki I host looks like plain old scraping. When it hits it's a steady load of lots of traffic going all over, from lots of IPs. And they really like diffs between old page revisions for some reason.

1 more reply

userbinator9mo ago

As I've been saying for a while now - if you want to filter for only humans, ask questions only a human can easily answer; counting the number of letters in a word seems to be a good way to filter out LLMs, for example. Yes, that can be relatively easily gotten around, just like Anubis, but with the benefit that it doesn't filter out humans and has absolutely minimal system requirements (a browser that can submit HTML forms), possibly even less than the site itself.

There are forums which ask domain-specific questions as a CAPTCHA upon attempting to register an account, and as someone who has employed such a method, it is very effective. (Example: what nominal diameter is the intake valve stem on a 1954 Buick Nailhead?)

ack_complete9mo ago

For smaller forums, any customization to the new account process will work. When I ran a forum that was getting a frustratingly high amount of spammer signups, I modified the login flow to ask the user to add 1 to the 6-digit number in the stock CAPTCHA. Spam signups dropped like a rock.

never_inline9mo ago

> counting the number of letters in a word seems to be a good way to filter out LLMs

As long as this challenge remains obscure enough to be not worth implementing special handlers in the crawler, this sounds a neat idea.

But I think if everyone starts doing this particular challenge (char count), the crawlers will start instructing a cheap LLM to do appropriate tool calls and get around it. So the challenge needs to be obscure.

I wonder if anyone tried building a crawler-firewall or even nginx script which will let the site admin plug their own challenge generator in lua or something, which would then create a minimum HTML form. Maybe even vibe code it :)

soared9mo ago

Tried and true method! An old video game forum named moparscape used to ask what mopar was and I always had to google it

Aachen9mo ago

Good thing modern bots can't do a web search!

1 more reply

cm20129mo ago

There is a decent segment of the population that will gave a hard time with that.

wavemode9mo ago

So it's no different from real CAPTCHAs, then.

hansjorg9mo ago

If you want a tip my friend, just block all of Huawei Cloud by ASN.

wging9mo ago

... looks like they did: https://github.com/TecharoHQ/anubis/pull/1004, timestamped a few hours after your comment.

scratchyone9mo ago

lmfao so that kinda defeats the entire point of this project if they have to resort to a manual IP blocklist anyways

1 more reply

iefbr149mo ago

I wouldn't be surprised if just delaying the server response by some 3 seconds will have the same effect on those scrapers as Anubis claims.

kingstnap9mo ago

There is literally no point wasting 3 seconds of a computer's time and it's expensive wasting 3 seconds of a person's time.

That is literally an anti-human filter.

Imustaskforhelp9mo ago

From tjhorner on this same thread

"Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project."

So its meant/preferred to block low effort crawlers which can still cause damage if you don't deal with them. a 3 second deterrent seems good in that regard. Maybe the 3 second deterrent can come as in rate limiting an ip? but they might use swath's of ip :/

1 more reply

loeg9mo ago

Anubis easily wastes 3 seconds of a human's time already.

psionides9mo ago

You've just described Anubis, yeah

1 more reply

ranger_danger9mo ago

Yea I'm not convinced unless somehow the vast majority of scrapers aren't already using headless browsers (which I assume they are). I feel like all this does is warm the planet.

jmclnx9mo ago

>The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans

Not for me, I have nothing but a hard time solving CAPTCHAs, ahout 50% of the time I give up after 2 tries.

serf9mo ago

it's still certainly trivial for you compared to mentally computing a SHA256 op.

anarki89mo ago

Article might be a bit shallow, or maybe my understanding of how Anubis works is incorrect?

1. Anubis makes you calculate a challenge.

2. You get a "token" that you can use for a week to access the website.

3. (I don't see this being considered in the article) "token" that is used too much is rate limited. Calculating a new token for each request is expensive.

jeroenhd9mo ago

That's the basic principle. It's a tool to fight to crawlers that spam requests without cookies to prevent rate limiting.

The Chinese crawlers seem to have adjusted their crawling techniques to give their browsers enough compute to pass standard Anubis checks.

Aachen9mo ago

That, but apparently also restrictions on what tech you can use to access the website:

- https://news.ycombinator.com/item?id=44971990 person being blocked with `message looking something like "you failed"`

- https://news.ycombinator.com/item?id=44970290 mentions of other requirements that are allegedly on purpose to block older clients (as browser emulators presumably often would appear to be, because why would they bother implementing newer mechanisms when the web has backwards compatibility)

listic9mo ago

So... Is Anubis actually blocking bots because they didn't bother to circumvent it?

loloquwowndueo9mo ago

Basically. Anubis is meant to block mindless, careless, rude bots with seemingly no technically proficient human behind the process; these bots tend to be very aggressive and make tons of requests bringing sites down.

The assumption is that if you’re the operator of these bots and care enough to implement the proof of work challenge for Anubis you could also realize your bot is dumb and make it more polite and considerate.

Of course nothing precludes someone implementing the proof of work on the bot but otherwise leaving it the same (rude and abusive). In this case Anubis still works as a somewhat fancy rate limiter which is still good.

elcritch9mo ago

Essentially the Pow aspect is pointless then? They could require almost any arbitrary thing.

1 more reply

Aachen9mo ago

s/circumvent/implement/

xphos9mo ago

Yeah the PoW is minor for botters but annoying people. I think the only positive is if enough people see anime girls on there screens there might actually be political pressure to make laws against rampent bot crawling

Havoc9mo ago

> PoW is minor for botters

But still enough to prevent a billion request DDoS

These sites have been search engine scrapped forever. It’s not about blocking bots entirely just about this new wave of fuck you I don’t care if your host goes down quasi malicious scrappers

st3fan9mo ago

"But still enough to prevent a billion request DDoS" - don't you just do the PoW once to get a cookie and then you can browse freely?

1 more reply

elcritch9mo ago

Reading TFA, those billions requests would cost web crawlers what about $100 in compute?

serf9mo ago

I don't care that they use anime catgirls.

What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net.

I hate Amazon's failure pets, I hate google's failure mini-games -- it strikes me as an organizational effort to get really good at failing rather than spending that same effort to avoid failures all together.

It's like everyone collectively thought the standard old Apache 404 not found page was too feature-rich and that customers couldn't handle a 3 digit error, so instead we now get a "Whoops! There appears to be an error! :) :eggplant: :heart: :heart: <pet image.png>" and no one knows what the hell is going on even though the user just misplaced a number in the URL.

heeton9mo ago

> What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net

This is probably intentional. They offer an paid unbranded version. If they had a corporate friendly brand on the free offering, then there would be fewer people paying for the unbranded one.

SnuffBox9mo ago

This is something I've always felt about design in general. You should never make it so that a symbol for an inconvenience appears happy or smug, it's a great way to turn people off your product or webpage.

Reddit implemented something a while back that says "You've been blocked by network security!" with a big smiling Reddit snoo front and centre on the page and every time I bump into it I can't help but think this.

xandrius9mo ago

The original versions were a way to make fun even a boring event such as a 404. If the page stops conveying the type of error to the user then it's just bad UX but also vomiting all the internal jargon to a non-tech user is bad UX.

So, I don't see an error code + something fun to be that bad.

People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today, so I don't see how having fun error pages to be such an issue?

Hizonner9mo ago

This assumes it's fun.

Usually when I hit an error page, and especially if I hit repeated errors, I'm not in the mood for fun, and I'm definitely not in the mood for "fun" provided by the people who probably screwed up to begin with. It comes off as "oops, we can't do anything useful, but maybe if we try to act cute you'll forget that".

Also, it was more fun the first time or two. There's a not a lot of orginal fun on the error pages you get nowadays.

> People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today

It's been a while, but I don't remember much gratuitous cutesiness on the 90s Web. Not unless you were actively looking for it.

1 more reply

JdeBP9mo ago

Guru Meditations and Sad Macs are not your thing?

Hizonner9mo ago

That also got old when you got it again and again while you were trying to actually do something. But there wasn't the space to fit quite as much twee on the screen...

krige9mo ago

FWIW second and third iteration of AmigaOS didn't have "Guru Meditation"; instead it bluntly labeled the numbers as error and task.

pak9rabid9mo ago

I hear this

walthamstow9mo ago

> I host this blog on a single core 128MB VPS

Where does one even find a VPS with such small memory today?

tambourine_man9mo ago

Or software to run on it. I'm intrigued about this claim as well.

1 more reply

thayne9mo ago

I can't find any documentation that says Anubis does this, (although it seems odd to me that it wouldn't, and I'd love a reference) but it could do the following:

1. Store the nonce (or some other identifier) of each jwt it passes out in the data store

2. Track the number or rate of requests from each token in the data store

3. If a token exceeds the rate limit threshold, revoke the token (or do some other action, like tarpit requests with that token, or throttle the requests)

Then if a bot solves the challenge it can only continue making requests with the token if it is well behaved and doesn't make requests too quickly.

It could also do things like limit how many tokens can be given out to a single ip address at a time to prevent a single server from generating a bunch of tokens.

SnuffBox9mo ago

Whenever I see an otherwise civil and mature project utilize something outwardly childish like this I audibly groan and close the page.

I'm sure the software behind it is fine but the imagery and style of it (and the confidence to feature it) makes me doubt the mental credibility/social maturity of anybody willing to make it the first thing you see when accessing a webpage.

Edit: From a quick check of the "CEO" of the company, I was unsurprised to have my concerns confirmed. I may be behind the times but I think there are far too many people in who act obnoxiously (as part of what can only be described as a new subculture) in open source software today and I wish there were better terms to describe it.

Philpax9mo ago

The argument isn't that it's difficult for them to circumvent - it's not - but that it adds enough friction to force them to rethink how they're scraping at scale and/or self-throttle.

I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.

davidclark9mo ago

The OP author shows that the cost to scrape an Anubis site is essentially zero since it is a fairly simple PoW algorithm that the scraper can easily solve. It adds basically no compute time or cost for a crawler run out of a data center. How does that force rethinking?

Philpax9mo ago

The cookie will be invalidated if shared between IPs, and it's my understanding that most Anubis deployments are paired with per-IP rate limits, which should reduce the amount of overall volume by limiting how many independent requests can be made at any given time.

That being said, I agree with you that there are ways around this for a dedicated adversary, and that it's unlikely to be a long-term solution as-is. My hope is that the act of having to circumvent Anubis at scale will prompt some introspection (do you really need to be rescraping every website constantly?), but that's hopeful thinking.

1 more reply

hooverd9mo ago

The problem with crawlers if that they're functionally indistinguishable from your average malware botnet in behavior. If you saw a bunch of traffic from residential IPs using the same token that's a big tell.

ajsnigrutin9mo ago

I always wondered about these anti bot precautions... as a firefox user, with ad blocking and 3rd party cookies disabled, i get the goddamn captcha or other random check (like this) on a bunch of pages now, every time i visit them...

Is it worth it? Millions of users wasting cpu and power for what? Saving a few cents on hosting? Just rate limit requests per second per IP and be done.

Sooner or later bots will be better at captchas than humans, what then? What's so bad with bots reading your blog? When bots evolve, what then? UK style, scan your ID card before you can visit?

The internet became a pain to use... back in the time, you opened the website and saw the content. Now you open it, get an antibot check, click, forward to the actual site, a cookie prompt, multiple clicks, then a headline + ads, scroll down a milimeter... do you want to subscribe to a newsletter? Why, i didn't even read the first sentence of the article yet... scroll down.. chat with AI bot popup... a bit further down login here to see full article...

Most of the modern web is unusable. I know I'm ranting, but this is just one of the pieces of a puzzle that makes basic browsing a pain these days.

extraduder_ire9mo ago

With the asymmetry of doing the PoW in javascript versus compiled c code, I wonder if this type of rate limiting is ever going to be directly implemented into regular web browsers. (I assume there's already plugins for curl/wget)

Other than Safari, mainstream browsers seem to have given up on considering browsing without javascript enabled a valid usecase. So it would purely be a performance improvement thing.

Aachen9mo ago

Apple supports people that want to not use their software as the gods at Apple intended it? What parallel universe Version of Apple is this!

Seriously though, does anything of Apple's work without JS, like Icloud or Find my phone? Or does Safari somehow support it in a way that other browsers don't?

1 more reply

tortillasauce9mo ago

Anubis works because AI crawlers do very little requests from an ip address to bypass rate-limiting. Last year they could still be blocked by ip range, but now the requests are from so many different networks that doesn't work anymore.

Doing the proof-of-work for every request is apparently too much work for them.

Crawlers using a single ip, or multiple ips from a single range are easily identifiable and rate-limited.

Borg39mo ago

Oh, its time to bring Internet back to humans. Maybe its time to treat first layer of Internet just as transport. Then, layer large VPN networks and put services there. People will just VPN to vISP to reach content. Different networks, different interests :) But this time dont fuck up abuse handling. Someone is doing something fishy? Depeer him from network (or his un-cooperating upstream!).

jonathanyc9mo ago

> The idea of “weighing souls” reminded me of another anti-spam solution from the 90s… believe it or not, there was once a company that used poetry to block spam!

> Habeas would license short haikus to companies to embed in email headers. They would then aggressively sue anyone who reproduced their poetry without a license. The idea was you can safely deliver any email with their header, because it was too legally risky to use it in spam.

Kind of a tangent but learning about this was so fun. I guess it's ultimately a hack for there not being another legally enforceable way to punish people for claiming "this email is not spam"?

IANAL so what I'm saying is almost certainly nonsense. But it seems weird that the MIT license has to explicitly say that the licensed software comes with no warranty that it works, but that emails don't have to come with a warranty that they are not spam! Maybe it's hard to define what makes an email spam, but surely it is also hard to define what it means for software to work. Although I suppose spam never e.g. breaks your centrifuge.

account429mo ago

Good on you that you found a solution to myself but personally I will just not use websites that pull this and not contribute to projects where using such a website is required. If you respect me so little that you will make demands about how I use my computer and block me as a bot if I don't comply then I am going to assume that you're not worth my time.

anarki89mo ago

This sounds a bit overdramatic for a less than a second waiting time per week for each device. Unless you employ an army of crawlers of course.

russelg9mo ago

Interesting take to say the Linux Kernel is not worth your time.

1 more reply

galaxyLogic9mo ago

I think the solution to captcha-rot is micro-payments. It does consume resources to serve a web-page so whose gonna pay for that?

If you want to do advertisement then don't require a payment, and be happy that crawlers will spread your ad to the users of AI-bots.

If you are a non-profit-site then it's great to get a micro-payment to help you maintain and run the site.

8cvor6j844qw_d69mo ago

On my daily browser with V8 JIT disabled, Cloudflare Turnstile has the worst performance hit, and often requires an additional click to clear.

Anubis usually clears in with no clicks and no noticeable slowdown, even with JIT off. Among the common CAPTCHA solutions it's the least annoying for me.

pembrook9mo ago

Something feels bizarrely incongruent about the people using Anubis. These people used to be the most vehemently pro-piracy, pro internet freedom and information accessibility, etc.

Yet now when it's AI accessing their own content, suddenly they become the DMCA and want to put up walls everywhere.

I'm not part of the AI doomer cult like many here, but it would seem to me that if you publish your content publicly, typically the point is that it would be publicly available and accessible to the world...or am I crazy?

As everything moves to AI-first, this just means nobody will ever find your content and it will not be part of the collective human knowledge. At which point, what's the point of publishing it.

SnuffBox9mo ago

It is rather funny. "We must prevent AI accessing the Arch Linux help files or it will start the singularity and kill us all!"

GreenWatermelon9mo ago

In case you're genuinely confused, the reason for Anubis and similar tools is that AI-training-data-scraping crawlers are assholes, and strangle the living shit out of any webserver they touch, like a cloud of starving locusts descending upon a wheat field.

i.e. it's DDoS protection.

buyucu9mo ago

We're 1-2 years away from putting the entire internet behind Cloudflare, and Anubis is what upsets you? I really don't get these people. Seeing an anime catgirl for 1-2 seconds won't kill you. It might save the internet though.

The principle behind Anubis is very simple: it forces every visitor to brute force a math problem. This cost is negligible if you're running it on your computer or phone. However, if you are running thousands of crawlers in parallel, the cost adds up. Anubis basically makes it expensive to crawl the internet.

It's not perfect, but much much better than putting everything behind Cloudflare.

Aachen9mo ago

> an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources

Sure, if you ignore that humans click on one page and the problematic scrapers (not the normal search engine volume, but the level we see nowadays where misconfigured crawlers go insane on your site) are requesting many thousands to millions of times more pages per minute. So they'll need many many times the compute to continue hammering your site whereas a normal user can muster to load that one page from the search results that they were interested in

lxgr9mo ago

> This isn’t perfect of course, we can debate the accessibility tradeoffs and weaknesses, but conceptually the idea makes some sense.

It was arguably never a great idea to begin with, and stopped making sense entirely with the advent of generative AI.

oblio9mo ago

Why?

ksymph9mo ago

Reading the original release post for Anubis [0], it seems like it operates mainly on the assumption that AI scrapers have limited support for JS, particularly modern features. At its core it's security through obscurity; I suspect that as usage of Anubis grows, more scrapers will deliberately implement the features needed to bypass it.

That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.

[0] https://xeiaso.net/blog/2025/anubis/

jhanschoo9mo ago

Your link explicitly says:

> It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.

It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.

It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.

ksymph9mo ago

Here's a more relevant quote from the link:

> Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access the site. This is a very simple way to block the most common AI scrapers because they are not able to execute JavaScript to solve the challenge. The scrapers that can execute JavaScript usually don't support the modern JavaScript features that Anubis requires. In case a scraper is dedicated enough to solve the challenge, Anubis lets them through because at that point they are functionally a browser.

As the article notes, the work required is negligible, and as the linked post notes, that's by design. Wasting scraper compute is part of the picture to be sure, but not really its primary utility.

1 more reply

ranger_danger9mo ago

The compute also only seems to happen once, not for every page load, so I'm not sure how this is a huge barrier.

2 more replies

heap_perms9mo ago

> I host this blog on a single core 128MB VPS

No wonder the site is being hugged to death. 128MB is not a lot. Maybe it's worth to upgrade if you post to hacker news. Just a thought.

bawolff9mo ago

It doesnt take much to host a static website. Its all the dynamic stuff/frameworks/db/etc that bogs everything down.

tambourine_man9mo ago

Still, 128MB is not enough to even run Debian let alone Apache/NGINX. I’m on my phone, but it doesn’t seem like the author is using Cloudflare or another CDN. I’d like to know what they are doing.

1 more reply

Aachen9mo ago

Moving bytes around doesn't take RAM but CPU. Notice how switches don't advertise how many gigabytes of RAM they have, but can push a few gigabits of content around between all 24 ports at once without even going expensive

Also, the HN homepage is pretty tame so long as you don't run WordPress. You don't get more than a few requests per second, so multiply that with the page size (images etc.) and you probably get a few megabits as bandwidth, no problem even for a Raspberry Pi 1 if the sdcard can read fast enough or the files are mapped to RAM by the kernel

zoobab9mo ago

Time to switch to stagit. Unfortunately it does not generate static pages for a git repo except "master". I am sure someone will modify to support branches.

zb39mo ago

Anubis doesn't use enough resources to deter AI bots. If you really want to go this way, use React, preferably with more than one UI framework.

johnisgood9mo ago

I like hashcash.

https://github.com/factor/factor/blob/master/extra/hashcash/...

https://bitcoinwiki.org/wiki/hashcash

loloquwowndueo9mo ago

Anubis is based on hashcash concepts - just adapted to a web request flow. Basically the same thing - moderately expensive for the sender/requester to compute, insanely cheap for the server/recipient to verify.

littlecranky679mo ago

We need bitcoin-based lightning nano-payments for such things. Like visiting the website will cost $0.0001 cent, the lightning invoice is embedded in the header and paid for after single-click confirmation or if threshold is under a pre-configured value. Only way to deal with AI crawlers and future AI scams.

With the current approach we just waste the energy, if you use bitcoin already mined (=energy previously wasted) it becomes sustainable.

grahar649mo ago

I write about something similar a while back https://maori.geek.nz/proof-of-human-2ee5b9a3fa28

About the difficulty of proving you are human especially when every test built has so much incentive to be broken. I don't think it will be solved, or could ever be solved.

fluoridation9mo ago

Hmm... What if instead of using plain SHA-256 it was a dynamically tweaked hash function that forced the client to run it in JS?

jsnell9mo ago

No, the economics will never work out for a Proof of Work-based counter-abuse challenge. CPU is just too cheap in comparison to the cost of human latency. An hour of a server CPU costs $0.01. How much is an hour of your time worth?

That's all the asymmetry you need to make it unviable. Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users. So there's no point in theorizing about an attacker solving the challenges cheaper than a real user's computer, and thus no point in trying to design a different proof of work that's more resistant to whatever trick the attackers are using to solve it for cheap. Because there's no trick.

pavon9mo ago

But for a scraper to be effective it has to load orders of magnitude more pages than a human browses, so a fixed delay causes a human to take 1.1x as long, but it will slow down scraper by 100x. Requiring 100x more hardware to do the same job is absolutely a significant economic impediment.

1 more reply

fluoridation9mo ago

>An hour of a server CPU costs $0.01. How much is an hour of your time worth?

That's irrelevant. A human is not going to be solving the challenge by hand, nor is the computer of a legitimate user going to be solving the challenge continuously for one hour. The real question is, does the challenge slow down clients enough that the server does not expend outsized resources serving requests of only a few users?

>Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users.

No, I disagree. If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.

2 more replies

VMG9mo ago

crawlers can run JS, and also invest into running the Proof-Of-JS better than you can

tjhorner9mo ago

Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project.

2 more replies

fluoridation9mo ago

If we're presupposing an adversary with infinite money then there's no solution. One may as well just take the site offline. The point is to spend effort in such a way that the adversary has to spend much more effort, hopefully so much it's impractical.

trostaft9mo ago

I actually really liked seeing the mascot. Brought a sense of whimsy to the Internet that I've missed for a long time.

andromaton9mo ago

Hug of death https://archive.ph/BSh1l

herf9mo ago

We deployed hashcash for a while back in 2004 to implement Picasa's email relay - at the time it was a pretty good solution because all our clients were kind of similar in capability. Now I think the fastest/slowest device is a broader range (just like Tavis says), so it is harder to tune the difficulty for that.

anonfordays9mo ago

Just use Anubis Bypass: https://addons.mozilla.org/en-US/android/addon/anubis-bypass...

Haven't seen dumb anime characters since.

jchw9mo ago

A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.

Secondly, Anubis specifically targets bots that try to blend in with human traffic. Bots that don't try to blend in with humans are basically ignored and out-of-scope. Most malicious bots don't want to be targeted, so they want to blend in... so they kind of have to deal with this. If they want to avoid the Anubis challenge, they have to essentially identify themselves. If not, they have to solve it.

Finally... If bots really want to durably be able to pass Anubis challenges, they pretty much have no choice but to run the arbitrary code. Anything else would be a pretty straight-forward cat and mouse game. And, that means that being able to accelerate the challenge response is a non-starter: if they really want to pass it, and not appear like a bot, the path of least resistance is to simply run a browser. That's a big hurdle and definitely does increase the complexity of scraping the Internet. It increases more the more sites that use this sort of challenge system. While the scrapers have more resources, tools like Anubis scale the resources required a lot more for scraping operations than it does a specific random visitor.

To me, the most important point is that it only fights bot traffic that intentionally tries to blend in. That's why it's OK that the proof-of-work challenge is relatively weak: the point is that it's non-trivial and can't be ignored, not that it's particularly expensive to compute.

If bots want to avoid the challenge, they can always identify themselves. Of course, then they can also readily be blocked, which is exactly what they want to avoid.

In the long term, I think the success of this class of tools will stem from two things:

1. Anti-botting improvements, particularly in the ability to punish badly behaved bots, and possibly share reputation information across sites.

2. Diversity of implementations. More implementations of this concept will make it harder for bots to just hardcode fastpath challenge response implementations and force them to actually run the code in order to pass the challenge.

I haven't kept up with the developments too closely, but as silly as it seems I really do think this is a good idea. Whether it holds up as the metagame evolves is anyone's guess, but there's actually a lot of directions it could be taken to make it more effective without ruining it for everyone.

o11c9mo ago

> A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.

... has phpbb not heard of the old "only create the session on the second visit, if the cookie was successfully created" trick?

jchw9mo ago

phpBB supports browsers that don't support or accept cookies: if you don't have a cookie, the URL for all links and forms will have the session ID in it. Which would be great, but it seems like these bots are not picking those up either for whatever reason.

MikeDVB9mo ago

We have been seeing our clients' sites being absolutely *hammered* by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.

Personally I have no issues with AI bots, that properly identify themselves, from scraping content as if the site operator doesn't want it to happen they can easily block the offending bot(s).

We built our own proof-of-work challenge that we enable on client sites/accounts as they come under 'attack' and it has been incredible how effective it is. That said I do think it is only a matter of time before the tactics change and these "malicious" AI bots are adapted to look more human / like real browsers.

I mean honestly it wouldn't be _that_ hard to enable them to run javascript or to emulate a real/accurate User-Agent. That said they could even run headless versions of the browser engines...

It's definitely going to be cat-and-mouse.

The most brutal honest truth is that if they throttled themselves as not to totally crash whatever site they're trying to scrape we'd probably have never noticed or gone through the trouble of writing our own proof-of-work challenge.

Unfortunately those writing/maintaining these AI bots that hammer sites to death probably either have no concept of the damage it can do or they don't care.

jchw9mo ago

> We have been seeing our clients' sites being absolutely hammered by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.

Yep. I noticed this too.

> That said they could even run headless versions of the browser engines...

Yes, exactly. To my knowledge that's what's going on with the latest wave that is passing Anubis.

That said, it looks like the solution to that particular wave is going to be to just block Huawei cloud IP ranges for now. I guess a lot of these requests are coming from that direction.

Personally though I think there are still a lot of directions Anubis can go in that might tilt this cat and mouse game a bit more. I have some optimism.

1 more reply

yuumei9mo ago

> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans. > Anubis – confusingly – inverts this idea.

Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.

1 more reply

m-p-39mo ago

And Codeberg, even behind Anubis, is not immune from scrapers either

https://social.anoxinon.de/@Codeberg/115033782514845941

qwertytyyuu9mo ago

Isn’t animus a dog? So it should be anime dog/wolf girl rather than cat girl?

Twisol9mo ago

Yes, Anubis is a dog-headed or jackal-headed god. I actually can't find anywhere on the Anubis website where they talk about their mascot; they just refer to her neutrally as the "default branding".

Since dog girls and cat girls in anime can look rather similar (both being mostly human + ears/tail), and the project doesn't address the point outright, we can probably forgive Tavis for assuming catgirl.

immibis9mo ago

The actual answer to how this blocks AI crawlers is that they just don't bother to solve the challenge. Once they do bother solving the challenge, the challenge will presumably be changed to a different one.

usbpoet9mo ago

I don't think I've ever actually seen Anubis once. Always interesting to see what's going on in parts of the internet you aren't frequenting.

dominick-cc9mo ago

I read hackernews on my phone when I'm bored and I've seen it a lot lately. I don't think I've ever seen it on my desktop.

anotherhue9mo ago

Surely the difficulty factor scales with the system load?

est9mo ago

I hope there's some kind of memory-hungry checker to replace the CPU cost.

a 2GB memory consumption wont stop them, but it will limit the parallelism of crawlers.

auggierose9mo ago

Would it not be more effective just to require payment for accessing your website? Then you don't need to care about bot or not.

deevus9mo ago

This seems like a good place to ask. How do I stop bots from signing up to my email list on my website without hosting a backend?

account429mo ago

Depending on your target audience you could require people signing up to send you and email first.

miohtama9mo ago

The solution is to make premium subscription service for those who do not want to solve CAPTCHAs.

Money is the best proof of humanity.

lock19mo ago

Isn't that line of reasoning implies companies with multi-billion dollars in their war chest are much more "human" than a literal human with student loans?

00039mo ago

Soon any attempt to actually do it would indicate you're a bot.

spiritplumber9mo ago

For the same reason why cats sit on your keyboard. Because they can

whatevaa9mo ago

Site doesn't load, must be hit by AI crawlers.

raffraffraff9mo ago

HN hug of death

mr_toad9mo ago

I’m getting a black page. Not sure if it’s an ironic meta commentary, or just my ad blocker.

pluc9mo ago

Can we talk about the "sexy anime girl" thing? Seems it's popular in geek/nerd/hacker circles and I for one don't get it. Browsing reddit anonymously you're flooded with near-pornographic fan-made renders of these things, I really don't get the appeal. Can someone enlighten me?

abustamam9mo ago

It's a good question. Anime (like many media, but especially anime) is known to have gratuitous fan service where girls/women of all ages are in revealing clothing for seemingly no reason except to just entice viewers.

The reasoning is that because they aren't real people, it's okay to draw and view images of anime, regardless of their age. And because geek/nerd circles tend not to socialize with real women, we get this over-proliferation of anime girls.

1 more reply

andai9mo ago

Probably depends on the person, but this stuff is mostly the cute instinct, same as videos of kittens. "Aww" and "I must protect it."

dominick-cc9mo ago

2D girls don't nag and I've never had to clear their clogged hair out of my shower drain.

SnuffBox9mo ago

I'd say it's partially a result of 4chan.

hollerith9mo ago

We live in a decadent society.

tonymet9mo ago

So it's a paywall with -- good intentions -- and even more accessibility concerns. Thus accelerating enshittification.

Who's managing the network effects? How do site owners control false positives? Do they have support teams granting access? How do we know this is doing any good?

It's convoluted security theater mucking up an already bloated , flimsy and sluggish internet. It's frustrating enough to guess schoolbuses every time I want to get work done, now I have to see porfnified kitty waifus

(openwrt is another community plagued with this crap)

tonymet9mo ago

here is the community post with Anubis pro / con experiences https://forum.openwrt.org/t/trying-out-anubis-on-the-wiki/23...

lousken9mo ago

aren't you happy? at least you see catgirl

a-dub9mo ago

blame canada

verall9mo ago

It's posts like this that make me really miss the webshit weekly

a-dub9mo ago

i suppose one nice property is that it is trivially scalable. if the problem gets really bad and the scrapers have llms embedded in them to solve captchas, the difficulty could be cranked up and the lifetime could be cranked down. it would make the user experience pretty crappy (party like it's 1999) but it could keep sites up for unauthenticated users without engaging in some captcha complexity race.

it does have arty political vibes though, the distributed and decentralized open source internet with guardian catgirls vs. late stage capitalism's quixotic quest to eat itself to death trying to build an intellectual and economic robot black hole.

ge969mo ago

Oh I saw this recently on ffmpeg's site, pretty fun

senectus19mo ago

the action is great, anubis is a very clever idea i love it.

I'm not a huge fan of the anime thing, but i can live with it.

valiant559mo ago

I really don't understand the hostility towards the mascot. I can't think of a bigger red flag.

Borgz9mo ago

Funny to say this when the article literally says "nothing wrong with mascots!"

Out of curiosity, what did you read as hostility?

valiant559mo ago

Oh I totally reacted to the title. The last few times Anubis has been the topic there's always comments about "cringy" mascot and putting that front and center in the title just made me believe that anime catgirls was meant as an insult.

1 more reply

efilife9mo ago

This cartoon mascot has absolutely nothing to do with anime

If you disagree, please say why

KolmogorovComp9mo ago

Why does Anubis not leverage PoW from its users to do something useful (at best, distributed computing for science, at worst, a crypto-currency at least allowing the webmasters to get back some cash)

johnklos9mo ago

People are already complaining. Could you imagine how much fodder this'd give people who didn't like the work or the distribution of any funds that a cryptocurrency would create (which would be pennies, I think, and more work to distribute than would be worth doing).

rnhmjoj9mo ago

I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?

I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.

mnmalst9mo ago

Because that solution simply does not work for all. People tried and the crawlers started using proxies with residential IPs.

hooverd9mo ago

less savory crawlers use residential proxies and are indistinguishable from malware traffic

busterarm9mo ago

Lots of companies run these kind of crawlers now as part of their products.

They buy proxies and rotate through proxy lists constantly. It's all residential IPs, so blocking IPs actually hurts end users. Often it's the real IPs of VPN service customers, etc.

There are lots of companies around that you can buy this type of proxy service from.

WesolyKubeczek9mo ago

You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.

rnhmjoj9mo ago

Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.

[1]: https://pod.geraspora.de/posts/17342163

1 more reply

majorchord9mo ago

> AI companies use residential proxies

Source:

1 more reply

superkuh9mo ago

Kernel.org* just has to actually configure Anubis rather than deploying the default broken config. Enable the meta-refresh proof of work rather than relying on the corporate browsers only bleeding edge javascript application proof of work.

* or whatever site the author is talking about, his site is currently inaccessible due to the amount of people trying to load it.

zaptrem9mo ago

If people are truly concerned about the crawlers hammering their 128mb raspberry pi website then a better solution would be to provide an alternative way for scrapers to access the data (e.g., voluntarily contribute a copy of their public site to something like common crawl).

If Anubis blocked crawler requests but helpfully redirected to a giant tar ball of every site using their service (with deltas or something to reduce bandwidth) I bet nobody would bother actually spending the time to automate cracking it since it’s basically negative value. You could even make it a torrent so most of the be costs are paid by random large labs/universities.

I think the real reason most are so obsessed with blocking crawlers is they want “their cut”… an imagined huge check from OpenAI for their fan fiction/technical reports/whatever.

sussmannbaka9mo ago

No, this doesn’t work. Many of the affected sites have these but they’re ignored. We’re talking about git forges, arguably the most standardised tool in the industry, where instead of just fetching the repository every single history revision of every single file gets recursively hammered to death. The people spending the VC cash to make the internet unusable right now don’t know how to program. They especially don’t give a shit about being respectful. They just hammer all the sites, all the time, forever.

lmm9mo ago

The kind of crawlers/scrapers who DDoS a site like this aren't going to bother checking common crawl or tarballs. You vastly overestimate the intelligence and prosociality of what bursty crawler requests tend to look like. (Anyone who is smart or prosocial will set up their crawler to not overwhelm a site with requests in the first place - yet any site with any kind of popularity gets flooded with these requests sooner or later)

1 more reply

shiomiru9mo ago

> I think the real reason most are so obsessed with blocking crawlers is they want “their cut”…

I find that an unfair view of the situation. Sure, there are examples such as StackOverflow (which is ridiculous enough as they didn't make the content) but the typical use case I've seen on the small scale is "I want to self-host my git repos because M$ has ruined GitHub, but some VC-funded assholes are drowning the server in requests".

They could just clone the git repo, and then pull every n hours, but it requires specialized code so they won't. Why would they? There's no money in maintaining that. And that's true for any positive measure you may imagine until these companies are fined for destroying the commons.

elsjaako9mo ago

There's a lot of people that really don't like AI, and simply don't want their data used for it.

1 more reply

msgodel9mo ago

I'm generally very pro-robot (every web UA is a robot really IMO) but these scrapers are exceptionally poorly written and abusive.

Plenty of organizations managed to crawl the web for decades without knocking things over. There's no reason to behave this way.

It's not clear to me why they've continued to run them like this. It seems so childish and ignorant.

1 more reply

jayrwren9mo ago

literally the top link when I search for his exact text "why are anime catgirls blocking my access to the Linux kernel?" https://lock.cmpxchg8b.com/anubis.html Maybe travis needs more google-fu. maybe that includes using duckduckgo?

Macha9mo ago

The top link when you search the title of the article is the article itself?

I am shocked, shocked I say.

j / k navigate · click thread line to collapse

Why are anime catgirls blocking my access to the Linux kernel? (opens in new tab)

908 comments