Anubis: Proof-of-work proxy to prevent AI crawlers (opens in new tab)

(anubis.techaro.lol)

100 pointstechknowlogick1y ago58 comments

58 comments

jchw1y ago

I'm really curious to see how this evolves as time goes on. Hashcash was originally conceived to stop e-mail SPAM, and a lot has changed since then, namely, compute has become absolutely dirt cheap. Despite that, PoW-based anti-bot remains somewhat enticing because it doesn't necessarily harm accessibility the way that solutions like Cloudflare or reCAPTCHA can: It should be possible to pass, even on a VPN or Tor, even on less used web browsers like Ladybird or Servo, and even if you're not on a super powerful device (provided you are willing to wait for the PoW challenge to pass, but as long as you don't have all of these conditions at once you should get an "easy" challenge and it should be quick.)

The challenge is definitely figuring out if this solution actually works at scale or not. I've played around with an implementation of Hashcash myself, using WebCrypto, but I worry because even using WebCrypto it is quite a lot slower than cracking hashes in native code. But seeing Anubis seemingly have some success makes me hopeful. If it gains broad adoption, it might just be enough of a pain in the ass for scrapers, while still being possible for automation to pass provided they can pay the compute toll (e.g. hopefully anything that's not terribly abusive.)

On a lighter note, I've found the reception of Anubis, and in particular the anime-style mascot, to be predictably amusing.

https://discourse.gnome.org/t/anime-girl-on-gnome-gitlab/276...

(Note: I'd personally suggest not going and replying here. Don't want to encourage brigading of any sort, just found this mildly amusing.)

rickydroll1y ago

FWIW, I was part of the gang talking with Adam about using Hashcash as a postage stamp as an anti-spam measure. I implemented an extension to Postfix using Hashcash as an anti-spam method (camram/two penny blue). The compute inflationary pressure you mentioned was only one of the challenges in deploying such a system. FWIW, none of the challenges were barriers; they were the kind of challenges you get when you take something beautiful, theoretical, and pure and try to make it real.

Some may not be aware that Hashcash value is as a decentralized rate limiter that can be added to almost any protocol. Experience with Hashcash taught us that it's essential to have a dynamic pricing scheme based on the reputation of the contact originator. In the email context, when a message sender connects to a receiving server, the receiving server should be able to tell the sender the size of the stamp needed based on the reputation created by previous messages.

From my perspective, the two main challenges in rate-limiting HTTP requests are embedding the required Hashcash stamp size and measuring the reputation of the request initiator. I think Anubis is a good first start, but the fact that it uses a fixed size stamp, a small one, does not have a reputation database, and from what I can tell, does not have a robust detector of good versus bad players. These shortcomings will make it challenging to provide adequate protection without interfering with good players.

I'll spare you my design note rambling, but I think from 3-page requests, one can gather enough information to determine the size of the Hashcash stamp for future requests.

xena1y ago

> I think Anubis is a good first start, but the fact that it uses a fixed size stamp, a small one, does not have a reputation database, and from what I can tell, does not have a robust detector of good versus bad players. These shortcomings will make it challenging to provide adequate protection without interfering with good players.

That's what it's intended to be right now. I've been thinking out how to do a reputation database; I'm half considering using a DHT like BitTorrent's for a cross-Anubis coordination (haven't filed an issue about this because I'm still noodling it out into a spec). I'm also working on more advanced risk calculation, but this kind of exploded out of nowhere for me.

1 more reply

avodonosov1y ago

Ideas:

- Make it generate cryptucurrency, so that the work is not wasted. Either to compensate for server expences hosting the content, or for some noble non-profit cause - all installations would collect the currency to a single account. Wasting the work is worse than these both options.

- An easy way for good crawlers (like internet archive) to authenticate themselves. E.g. TLS client side authentication or simply an HTTP request header containing signature for the request (the signature in the header may be based on, for example, on their domain name and the TLS cert for that domain)

ronsor1y ago

> Make it generate cryptucurrency, so that the work is not wasted. Either to compensate for server expences hosting the content, or for some noble non-profit cause - all installations would collect the currency to a single account. Wasting the work is worse than these both options.

The last time[0] we did this, everyone had a meltdown and blocked it.

[0] See Coinhive, which conspicuously lacks a Wikipedia page.

xg151y ago

Mining crypto can lead to misaligned incentives. Suddenly, people would have motivation to run this thing even on benign requests, etc.

Silphendio1y ago

There's a German Wikipedia page: https://de.m.wikipedia.org/wiki/Coinhive

wslh1y ago

It is mentioned at <https://en.wikipedia.org/wiki/Cryptojacking>

1 more reply

avodonosov1y ago

This time the site content will not be available if the challenge computation is blocked.

throwawayEm8UE1y ago

Maybe there should also be a protocol to support paying a small amount of crypto (on the order of 0.01-0.1 cents per request) to bypass the PoW. The bots can still scrape it by paying, but you now price in the externality.

avodonosov1y ago

Yes, I was also thinking of this.

"proof of pay" or "proof of a transaction"

An advantage of this would be not wasting electricity for the proof of work computation.

1 more reply

avodonosov1y ago

Alternatively to cryptocurrency, some other useful c9mputation. Like contribution to BOINC. Only need to find a way to prove the computation.

kebokyo1y ago

I’m not sure how you can generate cryptocurrency with just sha256 hashes lol. Plus, there’s already a configuration file that lets you determine what you’re looking out for in terms of scrapers. I need to look at it more myself especially because I plan on using this on my own website soon, but I hope there’s already a good way to whitelist archival sites.

Trung02461y ago

For no js solution, I think some sort of using optical illusion as captcha could works, especially https://en.wikipedia.org/wiki/Magic_Eye or something like https://www.youtube.com/watch?v=Bg3RAI8uyVw which could cleverly hide captcha answer within animated noise mess.

However these methods are not really accessibility-friendly tho.

nikisweeting1y ago

Doesn't seem to noticably slow down my test bot. Headful crawling already takes ~10sec/page so an extra 0.5sec is hardly that big a deal.

jeroenhd1y ago

How effective a PoW firewall is, depends on how conservatively the worst offenders configure their browser engine resource limits.

A small scraper may be able to afford spending the extra CPU cycles, but if you're like AI training bots, sometimes sending hundreds of browser instances at a time, the math becomes different.

From what I've read about the results, it seems like the approach is effective against the very worst scrapers and bots out there.

akoboldfrying1y ago

"Headful"?

danielheath1y ago

As opposed to “headless” - meaning it runs a whole browser.

1 more reply

pvg1y ago

Discussion here https://news.ycombinator.com/item?id=43422929

yjftsjthsd-h1y ago

> to stop AI crawlers

It'll do that too, but it's really more of a general-purpose anti-bot, right? A generic PoW-wall.

areyourllySorry1y ago

yeah not very different from https://gitgud.io/fatchan/haproxy-protection/ or other pow walls you see on onion sites

barlog1y ago

When I visited Xe-sann's page, I was curious to see Jackal-chann challenges in action. This is Anubis.

akoboldfrying1y ago

Regarding the problem of how to let "good" bots through:

You could use PKI: Drop the PoW if the client provides a TLS client certificate chain that asserts that <publicKey> corresponds to a private key that is controlled by <fullNamesAndAddressesOfThesePeople> (or just by, say, <people WhoControlThisUrl>, for Let's Encrypt-style automatable cert signing). This would be a slight hassle for good bot operators to set up, but not a very big deal. The result is that bad bots couldn't spoof good bots to get in.

(Obviously this strategy generalises to handling human users too -- but in that case, the loss of privacy, as well as admin inconvenience, makes it much less palatable.)

areyourllySorry1y ago

well, good bots usually publish their ip ranges, and that's way simpler

akoboldfrying1y ago

Yes, that's much simpler, but doesn't it mean that every site owner needs to manage a hardcoded set of "places to look" for these published IPs?

If I want to create a good bot tomorrow, where do I publish its IP addresses? IOW, how can I ensure that the world "notices"?

bno11y ago

What stops a scraper from detecting Anubis and just removing "Mozilla" from the user-agent string?

TonyTrapp1y ago

That would allow you to specifically lock out that bot based on its user-agent string. That's the main problem with AI scrapers, many of them normally use user-agents that cannot be easily blocked, so other means have to be found to keep them off your grounds.

xena1y ago

It breaks websites when you do that.

xg151y ago

It's a great idea, but I fear if this keeps going viral like it did in the last few days, more bot authors will be motivated to add special handling for it and e.g change the user agent to a non-Mozilla one.

Trung02461y ago

The performance on mobile is kinda suck tho, took like 30 seconds to wait for PoW on difficulty 4 on Firefox Android. By that time I have to resist the urge to switch to do something else.

iszomer1y ago

This reminded me of an article I printed (yes, with paper) at my college more than 20 years ago, titled Parasitic Computing. I don't remember where it was originally published but I do think I might have stumbled upon it via kuro5hin (maybe); a quick search resulted the publication from Nature (though paywalled).

- https://www.nature.com/articles/35091039

Alifatisk1y ago

This is like wehatecaptchas.com

ranger_danger1y ago

I would say it doesn't prevent anything, it just makes computers warm the planet more.

01HNNWZ0MV43FF1y ago

That crime is on the attackers hands.

Hydrocarbons should cost more. Vote for a pollution tax

jeroenhd1y ago

A quick PoW calculation per web session is probably not going to make a dent in pollution now that generative AI has become commonplace.

It'll suck for battery life if you're often browsing random website after random website, or if you're a bot farm, but in practice I don't think most welcome users will notice a thing.

theragra1y ago

Other comment mentions it takes 30 seconds in mobile Firefox to pass, obviously too much

akoboldfrying1y ago

> it just makes computers warm the planet more.

Since that would cost the bad guys money, they won't actually do it much.

kmeisthax1y ago

Is there a way to alter text to poison AI training sets? I know there's Glaze and Nightshade for images but I've heard of nothing to poison text models. To be clear, this wouldn't be a defensive measure to stop scraping; it'd be an offensive honeypot: you'd want to make pages that have the same text but mutated slightly differently each time, so that AI scrapers preferentially load up on your statistically different text and then yield a poisoned model. Ideally the scraper companies will realize what's going on and stop scraping.

xyzal1y ago

Generate lots (LOTS!) of bad, insecure or wrong code and link it somewhere on your page (see https://www.emergent-misalignment.com/ , Ideally, make your local LLaMa generate more examples like those in the linked site)

jeroenhd1y ago

Combine this with bug bounties and you can turn AI leeches into a money making scheme!

akoboldfrying1y ago

AFAICT this isn't possible, unless you're OK with showing everyone the poisoned text.

TTBOMK there's nothing here that "detects botness" of an individual request, because in the limit, that's impossible -- if an attacker has access to many different IPs to make requests from (and many do), then any sequence of bot-generated requests from different IPs is indistinguishable from the same set of requests made by actual living, breathing humans (and vice versa).

So how does Anubis work against bots if it can't actually detect them? Because of the economics behind them: To justify creating a bot in the first place, you need to scrape a lot of pages, so paying a small electricity cost per page means you will need to pay a lot overall. Humans pay this too, but because we request a much smaller number of pages, the overall cost is negligibly low.

imtringued1y ago

You glossed over how it works. Bots don't maintain session cookies to avoid rate limits, so they will have to do the challenge over and over again, whereas humans keep the session cookie and amortise the cost of the challenge over multiple requests.

1 more reply

kotenok20001y ago

You could make poisoned text extremely small, so it isn't visible to humans that percieve web page optically, but visible to crawlers.

throwawayEm8UE1y ago

I don't think that type of adversarial stuff will work in general. A better idea is to just make the models learn stuff their owners don't want. Maybe put erotica (AI generated or not), malware code, and other kinds of offensive content to annoy the bot owners.

Jotalea1y ago

In theory you could "scramhe teble txt" (scramble the text) and show the proper one to the user, but I don't know how (in)efficient would that be.

TUSF1y ago

One idea is to include a bunch of nonsense text into webpages, and use CSS to hide it from browsers. Depending on how aggressively you want to do it, this may affect some accessibility tooling.

drpossum1y ago

[flagged]

dang1y ago

Could you please stop being a jerk in HN comments? You've unfortunately been doing it repeatedly:

https://news.ycombinator.com/item?id=43426074

https://news.ycombinator.com/item?id=43422797

https://news.ycombinator.com/item?id=43422170

https://news.ycombinator.com/item?id=43422160

We have to ban accounts that keep doing that, so if you'd please review https://news.ycombinator.com/newsguidelines.html and fix this, we'd appreciate it.

j / k navigate · click thread line to collapse

58 comments

jchw1y ago

On a lighter note, I've found the reception of Anubis, and in particular the anime-style mascot, to be predictably amusing.

https://discourse.gnome.org/t/anime-girl-on-gnome-gitlab/276...

(Note: I'd personally suggest not going and replying here. Don't want to encourage brigading of any sort, just found this mildly amusing.)

rickydroll1y ago

I'll spare you my design note rambling, but I think from 3-page requests, one can gather enough information to determine the size of the Hashcash stamp for future requests.

xena1y ago

1 more reply

avodonosov1y ago

Ideas:

ronsor1y ago

The last time[0] we did this, everyone had a meltdown and blocked it.

[0] See Coinhive, which conspicuously lacks a Wikipedia page.

xg151y ago

Mining crypto can lead to misaligned incentives. Suddenly, people would have motivation to run this thing even on benign requests, etc.

Silphendio1y ago

There's a German Wikipedia page: https://de.m.wikipedia.org/wiki/Coinhive

wslh1y ago

It is mentioned at <https://en.wikipedia.org/wiki/Cryptojacking>

1 more reply

avodonosov1y ago

This time the site content will not be available if the challenge computation is blocked.

throwawayEm8UE1y ago

avodonosov1y ago

Yes, I was also thinking of this.

"proof of pay" or "proof of a transaction"

An advantage of this would be not wasting electricity for the proof of work computation.

1 more reply

avodonosov1y ago

Alternatively to cryptocurrency, some other useful c9mputation. Like contribution to BOINC. Only need to find a way to prove the computation.

kebokyo1y ago

Trung02461y ago

However these methods are not really accessibility-friendly tho.

nikisweeting1y ago

Doesn't seem to noticably slow down my test bot. Headful crawling already takes ~10sec/page so an extra 0.5sec is hardly that big a deal.

jeroenhd1y ago

How effective a PoW firewall is, depends on how conservatively the worst offenders configure their browser engine resource limits.

A small scraper may be able to afford spending the extra CPU cycles, but if you're like AI training bots, sometimes sending hundreds of browser instances at a time, the math becomes different.

From what I've read about the results, it seems like the approach is effective against the very worst scrapers and bots out there.

akoboldfrying1y ago

"Headful"?

danielheath1y ago

As opposed to “headless” - meaning it runs a whole browser.

1 more reply

pvg1y ago

Discussion here https://news.ycombinator.com/item?id=43422929

yjftsjthsd-h1y ago

> to stop AI crawlers

It'll do that too, but it's really more of a general-purpose anti-bot, right? A generic PoW-wall.

areyourllySorry1y ago

yeah not very different from https://gitgud.io/fatchan/haproxy-protection/ or other pow walls you see on onion sites

barlog1y ago

When I visited Xe-sann's page, I was curious to see Jackal-chann challenges in action. This is Anubis.

akoboldfrying1y ago

Regarding the problem of how to let "good" bots through:

(Obviously this strategy generalises to handling human users too -- but in that case, the loss of privacy, as well as admin inconvenience, makes it much less palatable.)

areyourllySorry1y ago

well, good bots usually publish their ip ranges, and that's way simpler

akoboldfrying1y ago

Yes, that's much simpler, but doesn't it mean that every site owner needs to manage a hardcoded set of "places to look" for these published IPs?

If I want to create a good bot tomorrow, where do I publish its IP addresses? IOW, how can I ensure that the world "notices"?

bno11y ago

What stops a scraper from detecting Anubis and just removing "Mozilla" from the user-agent string?

TonyTrapp1y ago

xena1y ago

It breaks websites when you do that.

xg151y ago

Trung02461y ago

The performance on mobile is kinda suck tho, took like 30 seconds to wait for PoW on difficulty 4 on Firefox Android. By that time I have to resist the urge to switch to do something else.

iszomer1y ago

- https://www.nature.com/articles/35091039

Alifatisk1y ago

This is like wehatecaptchas.com

ranger_danger1y ago

I would say it doesn't prevent anything, it just makes computers warm the planet more.

01HNNWZ0MV43FF1y ago

That crime is on the attackers hands.

Hydrocarbons should cost more. Vote for a pollution tax

jeroenhd1y ago

A quick PoW calculation per web session is probably not going to make a dent in pollution now that generative AI has become commonplace.

It'll suck for battery life if you're often browsing random website after random website, or if you're a bot farm, but in practice I don't think most welcome users will notice a thing.

theragra1y ago

Other comment mentions it takes 30 seconds in mobile Firefox to pass, obviously too much

akoboldfrying1y ago

> it just makes computers warm the planet more.

Since that would cost the bad guys money, they won't actually do it much.

kmeisthax1y ago

xyzal1y ago

jeroenhd1y ago

Combine this with bug bounties and you can turn AI leeches into a money making scheme!

akoboldfrying1y ago

AFAICT this isn't possible, unless you're OK with showing everyone the poisoned text.

imtringued1y ago

1 more reply

kotenok20001y ago

You could make poisoned text extremely small, so it isn't visible to humans that percieve web page optically, but visible to crawlers.

throwawayEm8UE1y ago

Jotalea1y ago

In theory you could "scramhe teble txt" (scramble the text) and show the proper one to the user, but I don't know how (in)efficient would that be.

TUSF1y ago

One idea is to include a bunch of nonsense text into webpages, and use CSS to hide it from browsers. Depending on how aggressively you want to do it, this may affect some accessibility tooling.

drpossum1y ago

[flagged]

dang1y ago

Could you please stop being a jerk in HN comments? You've unfortunately been doing it repeatedly:

https://news.ycombinator.com/item?id=43426074

https://news.ycombinator.com/item?id=43422797

https://news.ycombinator.com/item?id=43422170

https://news.ycombinator.com/item?id=43422160

We have to ban accounts that keep doing that, so if you'd please review https://news.ycombinator.com/newsguidelines.html and fix this, we'd appreciate it.

j / k navigate · click thread line to collapse