The challenge is definitely figuring out if this solution actually works at scale or not. I've played around with an implementation of Hashcash myself, using WebCrypto, but I worry because even using WebCrypto it is quite a lot slower than cracking hashes in native code. But seeing Anubis seemingly have some success makes me hopeful. If it gains broad adoption, it might just be enough of a pain in the ass for scrapers, while still being possible for automation to pass provided they can pay the compute toll (e.g. hopefully anything that's not terribly abusive.)
On a lighter note, I've found the reception of Anubis, and in particular the anime-style mascot, to be predictably amusing.
https://discourse.gnome.org/t/anime-girl-on-gnome-gitlab/276...
(Note: I'd personally suggest not going and replying here. Don't want to encourage brigading of any sort, just found this mildly amusing.)
Some may not be aware that Hashcash value is as a decentralized rate limiter that can be added to almost any protocol. Experience with Hashcash taught us that it's essential to have a dynamic pricing scheme based on the reputation of the contact originator. In the email context, when a message sender connects to a receiving server, the receiving server should be able to tell the sender the size of the stamp needed based on the reputation created by previous messages.
From my perspective, the two main challenges in rate-limiting HTTP requests are embedding the required Hashcash stamp size and measuring the reputation of the request initiator. I think Anubis is a good first start, but the fact that it uses a fixed size stamp, a small one, does not have a reputation database, and from what I can tell, does not have a robust detector of good versus bad players. These shortcomings will make it challenging to provide adequate protection without interfering with good players.
I'll spare you my design note rambling, but I think from 3-page requests, one can gather enough information to determine the size of the Hashcash stamp for future requests.
That's what it's intended to be right now. I've been thinking out how to do a reputation database; I'm half considering using a DHT like BitTorrent's for a cross-Anubis coordination (haven't filed an issue about this because I'm still noodling it out into a spec). I'm also working on more advanced risk calculation, but this kind of exploded out of nowhere for me.
- Make it generate cryptucurrency, so that the work is not wasted. Either to compensate for server expences hosting the content, or for some noble non-profit cause - all installations would collect the currency to a single account. Wasting the work is worse than these both options.
- An easy way for good crawlers (like internet archive) to authenticate themselves. E.g. TLS client side authentication or simply an HTTP request header containing signature for the request (the signature in the header may be based on, for example, on their domain name and the TLS cert for that domain)
The last time[0] we did this, everyone had a meltdown and blocked it.
[0] See Coinhive, which conspicuously lacks a Wikipedia page.
"proof of pay" or "proof of a transaction"
An advantage of this would be not wasting electricity for the proof of work computation.
However these methods are not really accessibility-friendly tho.
A small scraper may be able to afford spending the extra CPU cycles, but if you're like AI training bots, sometimes sending hundreds of browser instances at a time, the math becomes different.
From what I've read about the results, it seems like the approach is effective against the very worst scrapers and bots out there.
It'll do that too, but it's really more of a general-purpose anti-bot, right? A generic PoW-wall.
You could use PKI: Drop the PoW if the client provides a TLS client certificate chain that asserts that <publicKey> corresponds to a private key that is controlled by <fullNamesAndAddressesOfThesePeople> (or just by, say, <people WhoControlThisUrl>, for Let's Encrypt-style automatable cert signing). This would be a slight hassle for good bot operators to set up, but not a very big deal. The result is that bad bots couldn't spoof good bots to get in.
(Obviously this strategy generalises to handling human users too -- but in that case, the loss of privacy, as well as admin inconvenience, makes it much less palatable.)
If I want to create a good bot tomorrow, where do I publish its IP addresses? IOW, how can I ensure that the world "notices"?
Hydrocarbons should cost more. Vote for a pollution tax
It'll suck for battery life if you're often browsing random website after random website, or if you're a bot farm, but in practice I don't think most welcome users will notice a thing.
Since that would cost the bad guys money, they won't actually do it much.
TTBOMK there's nothing here that "detects botness" of an individual request, because in the limit, that's impossible -- if an attacker has access to many different IPs to make requests from (and many do), then any sequence of bot-generated requests from different IPs is indistinguishable from the same set of requests made by actual living, breathing humans (and vice versa).
So how does Anubis work against bots if it can't actually detect them? Because of the economics behind them: To justify creating a bot in the first place, you need to scrape a lot of pages, so paying a small electricity cost per page means you will need to pay a lot overall. Humans pay this too, but because we request a much smaller number of pages, the overall cost is negligibly low.
https://news.ycombinator.com/item?id=43426074
https://news.ycombinator.com/item?id=43422797
https://news.ycombinator.com/item?id=43422170
https://news.ycombinator.com/item?id=43422160
We have to ban accounts that keep doing that, so if you'd please review https://news.ycombinator.com/newsguidelines.html and fix this, we'd appreciate it.