Any access will fall into either of the following categories:
- client with JS and cookies. In this case the server now has an identity to apply rate limiting to, from the cookie. Humans should never hit it, but crawlers will be slowed down immensely or ejected. Of course the identity can be rotated — at the cost of solving the puzzle again.
- amnesiac (no cookies) clients with JS. Each access is now expensive.
(- no JS - no access.)
The point is to prevent parallel crawling and overloading the server. Crawlers can still start an arbitrary number of parallel crawls, but each one costs to start and needs to stay below some rate limit. Previously, the server would collapse under thousands of crawler requests per second. That is what Anubis is making prohibitively expensive.
I think TFA is generally quite good and has something of a good point about the economics of the situation, but finding the math shake out that way should, perhaps, lead one to question their starting point / assumptions[1].
In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?
[0] Mentioning 'impenetrable wall' is probably setting off alarm bells, because of course that would be a bad design.
[1] (Edited to add:) I should say 'to question their assumptions more' -- like I said, the article is quite good and it does present this as confusing, at least.
This is a nice explanation. It's much clearer than anything I've seen offered by Anubis’s authors, in terms of why or how it could be effective at preventing a site from being ravaged by hordes of ill-behaved bots.
That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale. It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.
You can actually see this in real life if you google web scraping services and which targets they claim to bypass - all of them bypass generic anti-bots like Cloudflare, Akamai etc. but struggle with custom and rare stuff like Chinese websites or small forums because scraping market is a market like any other and high value problems are solved first. So becoming a low value problem is a very easy way to avoid confrontation.
Isn't this what Microsoft is trying to do with their sliding puzzle piece and choose the closest match type systems?
Also, if you come in on a mobile browser it could ask you to lay your phone flat and then shake it up and down for a second or something similar that would be a challenge for a datacenter bot pretending to be a phone.
I usually just sit there on my phone pressing the "I am not a robot box" when it triggers.
These are trivial for an AI agent to solve though, even with very dumb watered down models.
>I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.
No need to mimic the actual challenge process. Just change your user agent to not have "Mozilla" in it; Anubis only serves you the challenge if it has that. For myself I just made a sideloaded browser extension to override the UA header for the handful of websites I visit that use Anubis, including those two kernel.org domains.
(Why do I do it? For most of them I don't enable JS or cookies for so the challenge wouldn't pass anyway. For the ones that I do enable JS or cookies for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)
Browser fingerprinting works best against people with unique headers. There's probably millions of people using an untouched safari on iPhone. Once you touch your user-agent header, you're likely the only person in the world with that fingerprint.
Hm. If your site is "sticky", can it mine Monero or something in the background?
We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"
Won't that break many other things? My understanding was that basically everyone's user-agent string nowadays is packed with a full suite of standard lies.
This anime girl is not Anubis. It's a modern cartoon characters that simply borrows the name because it sounds cool, without caring anything about the history or meaning behind it.
Anime culture does this all the time, drawing on inspiration from all cultures but nearly always only paying the barest lip service to the original meaning.
I don't have an issue with that, personally. All cultures and religions should be fair game as inspiration for any kind of art. But I do have an issue with claiming that the newly inspired creation is equivalent in any way to the original source just because they share a name and some other very superficial characteristics.
Sure, the people who make the AI scraper bots are going to figure out how to actually do the work. The point is that they hadn't, and this worked for quite a while.
As the botmakers circumvent, new methods of proof-of-notbot will be made available.
It's really as simple as that. If a new method comes out and your site is safe for a month or two, great! That's better than dealing with fifty requests a second, wondering if you can block whole netblocks, and if so, which.
This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
I actually find the featured article very interesting. It doesn't feel dismissive of Anubis, but rather it questions whether this particular solution makes sense or not in a constructive way.
If that's true Anubis should just remove the proof-of-work part, so legitimate human visitors don't have to stare at a loading screen for several seconds while their device wastes electricity.
Consider:
An adaptive password hash like bcrypt or Argon2 uses a work function to apply asymmetric costs to adversaries (attackers who don't know the real password). Both users and attackers have to apply the work function, but the user gets ~constant value for it (they know the password, so to a first approx. they only have to call it once). Attackers have to iterate the function, potentially indefinitely, in the limit obtaining 0 reward for infinite cost.
A blockchain cryptocurrency uses a work function principally as a synchronization mechanism. The work function itself doesn't have a meaningfully separate adversary. Everyone obtains the same value (the expected value of attempting to solve the next round of the block commitment puzzle) for each application of the work function. And note in this scenario most of the value returned from the work function goes to a small, centralized group of highly-capitalized specialists.
A proof-of-work-based antiabuse system wants to function the way a password hash functions. You want to define an adversary and then find a way to incur asymmetric costs on them, so that the adversary gets minimal value compared to legitimate users.
And this is in fact how proof-of-work-based antispam systems function: the value of sending a single spam message is so low that the EV of applying the work function is negative.
But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.
There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.
This is also how the Blu-Ray BD+ system worked.
The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).
The problem with "this is good because none of the scrapers even bother to do this POW yet" is that you don't need an annoying POW to get that value! You could just write a mildly complicated Javascript function, or do an automated captcha.
1) scrapers just run a full browser and wait for the page to stabilize. They did this before this thing launched, so it probably never worked.
2) The AI reading the page needs something like 5 seconds * 1600W to process it. Assuming my phone can even perform that much compute as efficiently as a server class machine, it’d take a large multiple of five seconds to do it, and get stupid hot in the process.
Note that (2) holds even if the AI is doing something smart like batch processing 10-ish articles at once.
That's the opposite of being dismissive. The author has taken the time to deeply understand both the problem and the proposed solution, and has taken the time to construct a well-researched and well-considered argument.
This is a confusing comment because it appears you don’t understand the well-written critique in the linked blog post.
> This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
The key point in the blog post is that it’s the inverse of a CAPTCHA: The proof of work requirement is solved by the computer automatically.
You don’t have to teach a computer how to solve this proof of work because it’s designed for the computer to solve the proof of work.
It makes the crawling process more expensive because it has to actually run scripts on the page (or hardcode a workaround for specific versions) but from a computational perspective that’s actually easier and far more deterministic than trying to have AI solve visual CAPTCHA challenges.
That's what I was hoping to get from the "Numbers" section.
I generally don't look up the logs or numbers on my tiny, personal web spaces hosted on my server, and I imagine I could, at some point, become the victim of aggressive crawling (or maybe I have without noticing because I've got an oversized server on a dual link connection).
But the numbers actually only show the performance of doing the PoW, not the effect it has had on any site — I am just curious, and I'd love it if someone has done the analysis, ideally grouped by the bot type ("OpenAI bot was responsible for 17% of all requests, this got reduced from 900k requests a day to 0 a day"...). Search, unfortunately, only gives me all the "Anubis is helping fight aggressive crawling" blog articles, nothing with substance (I haven't tried hard, I admit).
Edit: from further down the thread there's https://dukespace.lib.duke.edu/server/api/core/bitstreams/81... but no analysis of how many real customers were denied — more data would be even better
It might be a tool in the box. But it’s still cat and mouse.
In my place we quickly concluded the scrapers have tons of compute and the “proof-of-work” aspect was meaningless to them. It’s simply the “response from site changed, need to change our scraping code” aspect that helps.
Yes, for these human-based challenges. But this challenge is defined in code. It's not like crawlers don't run JavaScript. It's 2025, they all use headless browsers, not curl.
Would I do that again? Probably not. These days I’d require a weekly mDL or equivalent credential presentation.
I have to disagree that an anti-bot measure that only works globally for a few weeks until bots trivially bypass it is effective. In an arms race against bots the bots win. You have to outsmart them by challenging them to do something that only a human can do or is actually prohibitively expensive for bots to do at scale. Anubis doesn't pass that test. And now it’s littered everywhere defunct and useless.
Yes, but the fundamental problem is that the AI crawler does the same amount of work as a legitimate user, not more.
So if you design the work such that it takes five seconds on a five year old smartphone, it could inconvenience a large portion of your user base. But once that scheme is understood by the crawler, it will delay the start of their aggressive crawling by... well-under five seconds.
An open source javascript challenge as a crawler blocker may work until it gets large enough for crawlers to care, but then they just have an engineer subscribe to changes on GitHub and have new challenge algorithms implemented before the majority of the deployment base migrates.
For some sites Anubis might be fitting, but it should be mindfully deployed.
Sure the program itself is jank in multiple ways but it solves the problem well enough.
- Everything is pwned
- Security through obscurity is bad
Without taking to heart:
- What a threat model is
And settle on a kind of permanent contrarian nihilist doomerism.
Why eat greens? You'll die one day anyway.
It's quite an interesting piece, I feel like you projected something completely different onto it.
Your point is valid, but completely adjacent.
It’s ineffective. (And furry sex-subculture propaganda pushed by its author, which is out of place in such software.)
Everything got so corporate and sterile.
ComfyUI has what I think is a foxgirl as its official mascot, and that's the de-facto primary UI for generating Stable Diffusion or related content.
Counterpoint - it seems to work. People use anubis because its the best of bad options.
If theory and reality disagree, it means either you are missing something or your theory is wrong.
I'm an unsure if this deadpan humor or if the author has never tried to solve a CAPTCHA that is something like "select the squares with an orthodox rabbi present"
- https://www.htmlcenter.com/blog/now-thats-an-annoying-captch...
- https://depressedprogrammer.wordpress.com/2008/04/20/worst-c...
- https://medium.com/xato-security/a-captcha-nightmare-f6176fa...
There are some browser extensions for it too, like NopeCHA, it works 99% of the time and saves me the hassle of doing them.
Any site using CAPTCHA's today is really only hurting there real customers and low hanging fruit.
Of course this assumes they can't solve the capture themselves, with ai, which often they can.
Early 2000s captchas really were like that.
On that note, is kernel.org really using this for free and not the paid version without the anime? Linux Foundation really that desperate for cash after they gas up all the BMW's?
If it makes sense for an organization to donate to a project they rely on, then they should just donate. No need to debrand if it's not strictly required, all that would do is give the upstream project less exposure. For design reasons maybe? But LKML isn't "designed" at all, it has always exposed the raw ugly interface of mailing list software.
Also, this brand does have trust. Sure, I'm annoyed by these PoW captcha pages, but I'm a lot more likely to enable Javascript if it's the Anubis character, than if it is debranded. If it is debranded, it could be any of the privacy-invasive captcha vendors, but if it's Anubis, I know exactly what code is going to run.
It won't stop the crawlers immediately, but it might lead to an overhyped and underwhelming LLM release from a big name company, and force them to reassess their crawling strategy going forward?
Thing is, the actual lived experience of webmasters tells that the bots that scrape the internets for LLMs are nothing like crafted software. They are more like your neighborhood shit-for-brain meth junkies competing with one another who makes more robberies in a day, no matter the profit.
Those bots are extremely stupid. They are worse than script kiddies’ exploit searching software. They keep banging the pages without regard to how often, if ever, they change. If they were 1/10th like many scraping companies’ software, they wouldn’t be a problem in the first place.
Since these bots are so dumb, anything that is going to slow them down or stop them in their tracks is a good thing. Short of drone strikes on data centers or accidents involving owners of those companies that provide networks of botware and residential proxies for LLM companies, it seems fairly effective, doesn’t it?
Search engines, at least, are designed to index the content, for the purpose of helping humans find it.
Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.
This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."
either way the result is the same: they induce massive load
well written crawlers will:
- not hit a specific ip/host more frequently than say 1 req/5s
- put newly discovered URLs at the end of a distributed queue (NOT do DFS per domain)
- limit crawling depth based on crawled page quality and/or response time
- respect robots.txt
- make it easy to block themMeanwhile AI farms will just run their own nuclear reactors eventually and be unaffected.
I really don't understand why someone thought this was a good idea, even if well intentioned.
It seems there is a large number of operations crawling the web to build models that aren't using directly infrastructure hosted on AI farms BUT botnet running on commodity hardware and residencial networks to circumvent their ip range from being blacklisted. Anubis point is to block those.
Because I've got the same model line but about 3 or 4 years older and it usually just flashes by in the browser Lightning from F-droid which is an OS webview wrapper. On occasion a second or maybe two, I assume that's either bad luck in finding a solution or a site with a higher difficulty setting. Not sure if I've seen it in Fennec (firefox mobile) yet but, if so, it's the same there
I've been surprised that this low threshold stops bots but I'm reading in this thread that it's rather that bot operators mostly just haven't bothered implementing the necessary features yet. It's going to get worse... We've not even won the battle let alone the war. Idk if this is going to be sustainable, we'll see where the web ends up...
I've certainly seen Anubis take a few seconds (three or four maybe) but that was on a very old phone that barely loaded any website more complex than HN.
Maybe there's going to be some form of pay per browse system? even if it's some negligible cost on the order of 1$ per month (and packaged with other costs), I think economies of scale would allow servers to perform a lifetime of S24 captchas in a couple of seconds.
This however forces servers to increase the challenge difficulty, which increases the waiting time for the first-time access.
Although the long term problem is the business model of servers paying for all network bandwidth.
Actual human users have consumed a minority of total net bandwidth for decades:
https://www.atom.com/blog/internet-statistics/
Part 4 shows bots out using humans in 1996 8-/
What are "bots"? This needs to include goggleadservices, PIA sharing for profit, real-time ad auctions, and other "non-user" traffic.
The difference between that and the LLM training data scraping, is that the previous non-human traffic was assumed, by site servers, to increase their human traffic, through search engine ranking, and thus their revenue. However the current training data scraping is likely to have the opposite effect: capturing traffic with LLM summaries, instead of redirecting it to original source sites.
This is the first major disruption to the internet's model of finance since ad revenue look over after the dot bomb.
So far, it's in the same category as the environmental disaster in progress, ownership is refusing to acknowledge the problem, and insisting on business as usual.
Rational predictions are that it's not going to end well...
Servers do not "pay for all the network bandwidth" as if they are somehow being targeted for fees and carrying water for the clients that are somehow getting it for "free". Everyone pays for the bandwidth they use, clients, servers, and all the networks in between, one way or another. Nobody out there gets free bandwidth at scale. The AI scrapers are paying lots of money to scrape the internet at the scales they do.
Is the traffic that people are complaining about really training traffic?
My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.
That doesn't seem like enough traffic to be a really big problem.
On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.
Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.
Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.
So what's really going on here? Anybody actually know?
There are forums which ask domain-specific questions as a CAPTCHA upon attempting to register an account, and as someone who has employed such a method, it is very effective. (Example: what nominal diameter is the intake valve stem on a 1954 Buick Nailhead?)
As long as this challenge remains obscure enough to be not worth implementing special handlers in the crawler, this sounds a neat idea.
But I think if everyone starts doing this particular challenge (char count), the crawlers will start instructing a cheap LLM to do appropriate tool calls and get around it. So the challenge needs to be obscure.
I wonder if anyone tried building a crawler-firewall or even nginx script which will let the site admin plug their own challenge generator in lua or something, which would then create a minimum HTML form. Maybe even vibe code it :)
That is literally an anti-human filter.
Not for me, I have nothing but a hard time solving CAPTCHAs, ahout 50% of the time I give up after 2 tries.
1. Anubis makes you calculate a challenge.
2. You get a "token" that you can use for a week to access the website.
3. (I don't see this being considered in the article) "token" that is used too much is rate limited. Calculating a new token for each request is expensive.
The Chinese crawlers seem to have adjusted their crawling techniques to give their browsers enough compute to pass standard Anubis checks.
- https://news.ycombinator.com/item?id=44971990 person being blocked with `message looking something like "you failed"`
- https://news.ycombinator.com/item?id=44970290 mentions of other requirements that are allegedly on purpose to block older clients (as browser emulators presumably often would appear to be, because why would they bother implementing newer mechanisms when the web has backwards compatibility)
The assumption is that if you’re the operator of these bots and care enough to implement the proof of work challenge for Anubis you could also realize your bot is dumb and make it more polite and considerate.
Of course nothing precludes someone implementing the proof of work on the bot but otherwise leaving it the same (rude and abusive). In this case Anubis still works as a somewhat fancy rate limiter which is still good.
But still enough to prevent a billion request DDoS
These sites have been search engine scrapped forever. It’s not about blocking bots entirely just about this new wave of fuck you I don’t care if your host goes down quasi malicious scrappers
What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net.
I hate Amazon's failure pets, I hate google's failure mini-games -- it strikes me as an organizational effort to get really good at failing rather than spending that same effort to avoid failures all together.
It's like everyone collectively thought the standard old Apache 404 not found page was too feature-rich and that customers couldn't handle a 3 digit error, so instead we now get a "Whoops! There appears to be an error! :) :eggplant: :heart: :heart: <pet image.png>" and no one knows what the hell is going on even though the user just misplaced a number in the URL.
This is probably intentional. They offer an paid unbranded version. If they had a corporate friendly brand on the free offering, then there would be fewer people paying for the unbranded one.
Reddit implemented something a while back that says "You've been blocked by network security!" with a big smiling Reddit snoo front and centre on the page and every time I bump into it I can't help but think this.
So, I don't see an error code + something fun to be that bad.
People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today, so I don't see how having fun error pages to be such an issue?
Where does one even find a VPS with such small memory today?
1. Store the nonce (or some other identifier) of each jwt it passes out in the data store
2. Track the number or rate of requests from each token in the data store
3. If a token exceeds the rate limit threshold, revoke the token (or do some other action, like tarpit requests with that token, or throttle the requests)
Then if a bot solves the challenge it can only continue making requests with the token if it is well behaved and doesn't make requests too quickly.
It could also do things like limit how many tokens can be given out to a single ip address at a time to prevent a single server from generating a bunch of tokens.
I'm sure the software behind it is fine but the imagery and style of it (and the confidence to feature it) makes me doubt the mental credibility/social maturity of anybody willing to make it the first thing you see when accessing a webpage.
Edit: From a quick check of the "CEO" of the company, I was unsurprised to have my concerns confirmed. I may be behind the times but I think there are far too many people in who act obnoxiously (as part of what can only be described as a new subculture) in open source software today and I wish there were better terms to describe it.
I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.
Is it worth it? Millions of users wasting cpu and power for what? Saving a few cents on hosting? Just rate limit requests per second per IP and be done.
Sooner or later bots will be better at captchas than humans, what then? What's so bad with bots reading your blog? When bots evolve, what then? UK style, scan your ID card before you can visit?
The internet became a pain to use... back in the time, you opened the website and saw the content. Now you open it, get an antibot check, click, forward to the actual site, a cookie prompt, multiple clicks, then a headline + ads, scroll down a milimeter... do you want to subscribe to a newsletter? Why, i didn't even read the first sentence of the article yet... scroll down.. chat with AI bot popup... a bit further down login here to see full article...
Most of the modern web is unusable. I know I'm ranting, but this is just one of the pieces of a puzzle that makes basic browsing a pain these days.
Other than Safari, mainstream browsers seem to have given up on considering browsing without javascript enabled a valid usecase. So it would purely be a performance improvement thing.
Seriously though, does anything of Apple's work without JS, like Icloud or Find my phone? Or does Safari somehow support it in a way that other browsers don't?
Doing the proof-of-work for every request is apparently too much work for them.
Crawlers using a single ip, or multiple ips from a single range are easily identifiable and rate-limited.
> Habeas would license short haikus to companies to embed in email headers. They would then aggressively sue anyone who reproduced their poetry without a license. The idea was you can safely deliver any email with their header, because it was too legally risky to use it in spam.
Kind of a tangent but learning about this was so fun. I guess it's ultimately a hack for there not being another legally enforceable way to punish people for claiming "this email is not spam"?
IANAL so what I'm saying is almost certainly nonsense. But it seems weird that the MIT license has to explicitly say that the licensed software comes with no warranty that it works, but that emails don't have to come with a warranty that they are not spam! Maybe it's hard to define what makes an email spam, but surely it is also hard to define what it means for software to work. Although I suppose spam never e.g. breaks your centrifuge.
If you want to do advertisement then don't require a payment, and be happy that crawlers will spread your ad to the users of AI-bots.
If you are a non-profit-site then it's great to get a micro-payment to help you maintain and run the site.
Anubis usually clears in with no clicks and no noticeable slowdown, even with JIT off. Among the common CAPTCHA solutions it's the least annoying for me.
Yet now when it's AI accessing their own content, suddenly they become the DMCA and want to put up walls everywhere.
I'm not part of the AI doomer cult like many here, but it would seem to me that if you publish your content publicly, typically the point is that it would be publicly available and accessible to the world...or am I crazy?
As everything moves to AI-first, this just means nobody will ever find your content and it will not be part of the collective human knowledge. At which point, what's the point of publishing it.
i.e. it's DDoS protection.
The principle behind Anubis is very simple: it forces every visitor to brute force a math problem. This cost is negligible if you're running it on your computer or phone. However, if you are running thousands of crawlers in parallel, the cost adds up. Anubis basically makes it expensive to crawl the internet.
It's not perfect, but much much better than putting everything behind Cloudflare.
Sure, if you ignore that humans click on one page and the problematic scrapers (not the normal search engine volume, but the level we see nowadays where misconfigured crawlers go insane on your site) are requesting many thousands to millions of times more pages per minute. So they'll need many many times the compute to continue hammering your site whereas a normal user can muster to load that one page from the search results that they were interested in
It was arguably never a great idea to begin with, and stopped making sense entirely with the advent of generative AI.
That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.
> It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.
It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.
It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.
No wonder the site is being hugged to death. 128MB is not a lot. Maybe it's worth to upgrade if you post to hacker news. Just a thought.
Also, the HN homepage is pretty tame so long as you don't run WordPress. You don't get more than a few requests per second, so multiply that with the page size (images etc.) and you probably get a few megabits as bandwidth, no problem even for a Raspberry Pi 1 if the sdcard can read fast enough or the files are mapped to RAM by the kernel
With the current approach we just waste the energy, if you use bitcoin already mined (=energy previously wasted) it becomes sustainable.
About the difficulty of proving you are human especially when every test built has so much incentive to be broken. I don't think it will be solved, or could ever be solved.
That's all the asymmetry you need to make it unviable. Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users. So there's no point in theorizing about an attacker solving the challenges cheaper than a real user's computer, and thus no point in trying to design a different proof of work that's more resistant to whatever trick the attackers are using to solve it for cheap. Because there's no trick.
Haven't seen dumb anime characters since.
A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.
Secondly, Anubis specifically targets bots that try to blend in with human traffic. Bots that don't try to blend in with humans are basically ignored and out-of-scope. Most malicious bots don't want to be targeted, so they want to blend in... so they kind of have to deal with this. If they want to avoid the Anubis challenge, they have to essentially identify themselves. If not, they have to solve it.
Finally... If bots really want to durably be able to pass Anubis challenges, they pretty much have no choice but to run the arbitrary code. Anything else would be a pretty straight-forward cat and mouse game. And, that means that being able to accelerate the challenge response is a non-starter: if they really want to pass it, and not appear like a bot, the path of least resistance is to simply run a browser. That's a big hurdle and definitely does increase the complexity of scraping the Internet. It increases more the more sites that use this sort of challenge system. While the scrapers have more resources, tools like Anubis scale the resources required a lot more for scraping operations than it does a specific random visitor.
To me, the most important point is that it only fights bot traffic that intentionally tries to blend in. That's why it's OK that the proof-of-work challenge is relatively weak: the point is that it's non-trivial and can't be ignored, not that it's particularly expensive to compute.
If bots want to avoid the challenge, they can always identify themselves. Of course, then they can also readily be blocked, which is exactly what they want to avoid.
In the long term, I think the success of this class of tools will stem from two things:
1. Anti-botting improvements, particularly in the ability to punish badly behaved bots, and possibly share reputation information across sites.
2. Diversity of implementations. More implementations of this concept will make it harder for bots to just hardcode fastpath challenge response implementations and force them to actually run the code in order to pass the challenge.
I haven't kept up with the developments too closely, but as silly as it seems I really do think this is a good idea. Whether it holds up as the metagame evolves is anyone's guess, but there's actually a lot of directions it could be taken to make it more effective without ruining it for everyone.
... has phpbb not heard of the old "only create the session on the second visit, if the cookie was successfully created" trick?
Personally I have no issues with AI bots, that properly identify themselves, from scraping content as if the site operator doesn't want it to happen they can easily block the offending bot(s).
We built our own proof-of-work challenge that we enable on client sites/accounts as they come under 'attack' and it has been incredible how effective it is. That said I do think it is only a matter of time before the tactics change and these "malicious" AI bots are adapted to look more human / like real browsers.
I mean honestly it wouldn't be _that_ hard to enable them to run javascript or to emulate a real/accurate User-Agent. That said they could even run headless versions of the browser engines...
It's definitely going to be cat-and-mouse.
The most brutal honest truth is that if they throttled themselves as not to totally crash whatever site they're trying to scrape we'd probably have never noticed or gone through the trouble of writing our own proof-of-work challenge.
Unfortunately those writing/maintaining these AI bots that hammer sites to death probably either have no concept of the damage it can do or they don't care.
Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.
Since dog girls and cat girls in anime can look rather similar (both being mostly human + ears/tail), and the project doesn't address the point outright, we can probably forgive Tavis for assuming catgirl.
a 2GB memory consumption wont stop them, but it will limit the parallelism of crawlers.
Money is the best proof of humanity.
Who's managing the network effects? How do site owners control false positives? Do they have support teams granting access? How do we know this is doing any good?
It's convoluted security theater mucking up an already bloated , flimsy and sluggish internet. It's frustrating enough to guess schoolbuses every time I want to get work done, now I have to see porfnified kitty waifus
(openwrt is another community plagued with this crap)
it does have arty political vibes though, the distributed and decentralized open source internet with guardian catgirls vs. late stage capitalism's quixotic quest to eat itself to death trying to build an intellectual and economic robot black hole.
I'm not a huge fan of the anime thing, but i can live with it.
If you disagree, please say why
I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.
* or whatever site the author is talking about, his site is currently inaccessible due to the amount of people trying to load it.
If Anubis blocked crawler requests but helpfully redirected to a giant tar ball of every site using their service (with deltas or something to reduce bandwidth) I bet nobody would bother actually spending the time to automate cracking it since it’s basically negative value. You could even make it a torrent so most of the be costs are paid by random large labs/universities.
I think the real reason most are so obsessed with blocking crawlers is they want “their cut”… an imagined huge check from OpenAI for their fan fiction/technical reports/whatever.