Anubis Works (opens in new tab)

(xeiaso.net)

319 pointsevacchi1y ago208 comments

208 comments

Related Anubis: Proof-of-work proxy to prevent AI crawlers (100 points, 23 days ago, 58 comments) https://news.ycombinator.com/item?id=43427679

raggi1y ago

It's amusing that Xe managed to turn what was historically mostly a joke/shitpost into an actually useful product. They did always say timing was everything.

I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.

Unesco surprised me some, the sub-site in question is pretty big, it has thousands of documents of content, but the content is static - this should be trivial to serve, so what's going on? Well it looks like it's a poorly deployed Wordpress on top of Apache, with no caching enabled, no content compression, no HTTP 2/3. It would likely be fairly easy to get this serving super cheap on a very small machine, but of course doing so requires some expertise, and expertise still isn't cheap.

Sure you could ask an LLM, but they still aren't good at helping when you have no clue what to ask - if you don't even really know the site is slower than it should be, why would you even ask? You'd just hear about things getting crushed and reach for the furry defender.

adrian171y ago

> but of course doing so requires some expertise, and expertise still isn't cheap

Sure, but at the same time, the number of people with expertise to set up Anubis (not that it's particularly hard, but I mean: even be aware that it exists) is surely even lower than of people with Wordpress administration experience, so I'm still surprised.

If I were to guess, the reasons for not touching Wordpress were unrelated, like: not wanting to touch a brittle instance, or organization permissions, or maybe the admins just assumed that WP is configured well already.

raggi1y ago

I have trouble with that because it’s brimming full of plugins too (see them all disorganized all over the source), and failing to keep such a system up to date ends in tears rapidly in that ecosystem.

jtbayly1y ago

My site that I’d like this for has a lot of posts, but there are links to a faceted search system based on tags that produces an infinite number of possible combinations and pages for each one. There is no way to cache this, and the bots don’t respect the robots file, so they just constantly request URLs, getting the posts over and over in different numbers and combinations. It’s a pain.

mrweasel1y ago

> I am kind of surprised how many sites seem to want/need this.

The AI scrapers are not only poorly written, they also go out of their way to do cache busting. So far I've seen a few solutions, CloudFlare, require a login, Anubis, or just insane amounts of infrastructure. Some site have reported 60% of their traffic coming from bots not, smaller sites is probably much higher.

MrJohz1y ago

Fwiw, I run a pretty tiny site and see relatively minimal traffic coming from bots. Most of the bot traffic, when it appears, is vulnerability scanners (the /wp-admin/ requests on a static site), and has little impact on my overall stats.

My guess is that these tools tend to be targeted at mid-sized sites — the sorts of places that are large enough to have useful content, but small enough that there probably won't be any significant repercussions, and where the ops team is small enough (or plain nonexistent) that there's not going to be much in the way of blocks. That's why a site like SourceHut gets hit quite badly, but smaller blogs stay largely out of the way.

But that's just a working theory without much evidence trying to justify why I'm hearing so many people talking about struggling with AI bot traffic and not seeing it myself.

nicolapcweek941y ago

Well, we just spun up anubis in front of a two user private (as in publicly accessible but with almost all content set to private/login protected) forgejo instance after it started getting hammered (mostly by amazon ips presenting as amazonbot) earlier in the week, resulting in a >90% traffic reduction. From what we’ve seen (and Xe’s own posts) it seems git forges are getting hit harder than most other sites, though, so YMMV i guess.

mrweasel1y ago

I actually have a theory, based on the last episode of the 2.5 admins podcast. Try spinning up a MediaWiki site. I have a feeling that wiki installation are being targeted to a much higher degree. You could also do a Git repo of some sort. Either two could give the impression that content is changed frequently.

2 more replies

cedws1y ago

PoW anti-bot/scraping/DDOS was already being done a decade ago, I’m not sure why it’s only catching on now. I even recall a project that tried to make the PoW useful.

xena1y ago

Xe here. If I had to guess in two words: timing and luck. As the G-man said: the right man in the wrong place can make all the difference in the world. I was the right shitposter in the right place at the right time.

And then the universe blessed me with a natural 20. Never had these problems before. This shit is wild.

underdeserver1y ago

Squeeze that lemon as far as it'll go mate, god speed and may the good luck continue.

gyomu1y ago

If you’re confused about what this is - it’s to prevent AI scraping.

> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums

https://anubis.techaro.lol/docs/design/how-anubis-works

This is pretty cool, I have a project or two that might benefit from it.

x3haloed1y ago

I’ve been wondering to myself for many years now whether the web is for humans or machines. I personally can’t think of a good reason to specifically try to gate bots when it comes to serving content. Trying to post content or trigger actions could obviously be problematic under many circumstances.

But I find that when it comes to simple serving of content, human vs. bot is not usually what you’re trying to filter or block on. As long as a given client is not abusing your systems, then why do you care if the client is a human?

xboxnolifes1y ago

> As long as a given client is not abusing your systems, then why do you care if the client is a human?

Well, that's the rub. The bots are abusing the systems. The bots are accessing the contents at rates thousands of times faster and more often than humans. The bots also have access patterns unlike your expected human audience (downloading gigabytes or terabytes of data multiples times, over and over).

And these bots aren't some being with rights. They're tools unleashed by humans. It's humans abusing the systems. These are anti-abuse measures.

immibis1y ago

Then you look up their IP address's abuse contact, send an email and get them to either stop attacking you or get booted off the internet so they can't attack you.

And if that doesn't happen, you go to their ISP's ISP and get their ISP booted off the Internet.

Actual ISPs and hosting providers take abuse reports extremely seriously, mostly because they're terrified of getting kicked off by their ISP. And there's no end to that - just a chain of ISPs from them to you and you might end with convincing your ISP or some intermediary to block traffic from them. However, as we've seen recently, rules don't apply if enough money is involved. But I'm not sure if these shitty interim solutions come from ISPs ignoring abuse when money is involved, or from not knowing that abuse reporting is taken seriously to begin with.

Anyone know if it's legal to return a never-ending stream of /dev/urandom based on the user-agent?

4 more replies

bbor1y ago

Well, that's the meta-rub: if they're abusing, block abuse. Rate limits are far simpler, anyway!

In the interest of bringing the AI bickering to HN: I think one could accurately characterize "block bots just in case they choose to request too much data" as discrimination! Robots of course don't have any rights so it's not wrong, but it certainly might be unwise.

1 more reply

t-writescode1y ago

> I personally can’t think of a good reason to specifically try to gate bots

There's been numerous posts on HN about people getting slammed, to the tune of many, many dollars and terabytes of data from bots, especially LLM scrapers, burning bandwidth and increasing server-running costs.

ronsor1y ago

I'm genuinely skeptical that those are all real LLM scrapers. For one, a lot of content is in CommonCrawl and AI companies don't want to redo all that work when they can get some WARC files from AWS.

I'm largely suspecting that these are mostly other bots pretending to be LLM scrapers. Does anyone even check if the bots' IP ranges belong to the AI companies?

4 more replies

praptak1y ago

The good thing about proof of work is that it doesn't specifically gate bots.

It may have some other downsides - for example I don't think that Google is possible in a world where everyone requires proof of work (some may argue it's a good thing) but it doesn't specifically gate bots. It gates mass scraping.

fc417fc8021y ago

Things like google are still possible. Operators would need to whitelist services.

Alternatively shared resources similar in spirit to common crawl but scaled up could be used. That would have the benefit of democratizing the ability to create and operate large scale search indexes.

brikym1y ago

As both a website host and website scraper I can see both sides of it. The website owners have very little interest in opening their data up; if they did they'd have made an API for it. In my case it's scraping supermarket prices so obviously big-grocery doesn't want a spot light on their arbitrary pricing patterns. It's frustrating for us scrapers but from their perspective opening up to bots is just a liability. Besides bots just spamming the servers getting around rate limits with botnets and noise any new features added by bots probably won't benefit them. If I made a bot service that would split your orders over multiple supermarkets, or buy items temporally as prices drop that wouldn't benefit the companies. All the work they've put into their site is to bring them to the status quo and they want to keep it that way. The companies don't want an open internet, only we do. I'd like to see some transparency laws so that large companies need to publish their pricing.

gbear6051y ago

The issue is not whether it’s a human or a bot. The issue is whether you’re sending thousands of requests per second for hours, effectively DDOSing the site, or if you’re behaving like a normal user.

laserbeam1y ago

The reason is: bots DO spam you repeatedly and increase your network costs. Humans don’t abuse the same way.

starkrights1y ago

Example problem that I’ve seen posted about a few times on HN: LLM scrapers (or at least, an explosion of new scrapers) exploding and mindlessly crawling every singly HTTP endpoint of a hosted git-service, instead of just cloning the repo. (entirely ignoring robots.txt)

The point of this is that there has recently been a massive explosion in the amount of bots that blatantly, aggressively, and maliciously ignore and attempt to bypass (mass ip/VPN switching, user agent swapping, etc) anti-abuse gates.

1 more reply

mieses1y ago

There is hope for misguided humans.

namanyayg1y ago

"It also uses time as an input, which is known to both the server and requestor due to the nature of linear timelines"

A funny line from his docs

xena1y ago

OMG lol I forgot that I left that in. Hilarious. I think I'm gonna keep it.

didgeoridoo1y ago

I didn’t even blink at this, my inner monologue just did a little “well, naturally” in a Redditor voice and kept reading.

mkl1y ago

BTW Xe, https://xeiaso.net/pronouns is 404 since sometime last year, but it is still linked to from some places like https://xeiaso.net/blog/xe-2021-08-07/ (I saw "his" above and went looking).

xena1y ago

I'm considering making it come back, but it's just gotten me too much abuse so I'm probably gonna leave it 404-ing until society is better.

2 more replies

pie_flavor1y ago

Unfortunately it is also false (if taken out of context; Anubis rounds the time to the nearest week, which is probably good enough if the next-nearest week is valid too). Clock desync for a variety of reasons is pervasive - you can't expect the 10th percentile of your users to be accurate even to the day, and even the 25th percentile will be five minutes or so off.

AnonC1y ago

Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute! (I’ve always found all the art and the characters in Xe’s blog very beautiful)

Tangentially, I was wondering how this would impact common search engines (not AI crawlers) and how this compares to Cloudflare’s solution to stop AI crawlers, and that’s explained on the GitHub page. [1]

> Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug.

> This is a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand.

> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.

[1]: https://github.com/TecharoHQ/anubis/

JsonCameron1y ago

Yeah. Unfortunately at the current moment it does prevent indexing. Perhaps down the line we can whitelist search engines ips. However some like google, use the same for the AI and search indexing.

We are still making some improvements like passing open graph tags through so at least rich previews work!

snvzz1y ago

>Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute!

Love them too, and abhor knowing that someone is bound to eventually remove them because found to be "problematic" in one way or another.

pohuing1y ago

There's this funny instance[1] of someone afraid their their gf might see them and think they're into anime. But anyhow using an image and the image itself is up to the site since Anubis let's you configure it.

[1] https://discourse.gnome.org/t/anime-girl-on-gnome-gitlab/276...

prologic1y ago

I've read about Anubis, cool project! Unfortunately, as pointed out in the comments, requires your site's visitors to have Javascript™ enabled. This is totally fine for sites that require Javascript™ anyway to enhance the user experience, but not so great for static sites and such that require no JS at all.

I built my own solution that effectively blocks these "Bad Bots" at the network level. I effectively block the entirety of several large "Big Tech / Big LLM" networks entirely at the ASN (BGP) by utilizing MaxMind's database and a custom WAF and Reverse Proxy I put together.

xyzzy_plugh1y ago

A significant portion of the bot traffic TFA is designed to handle originates from consumer/residential space. Sure, there are ASN games being played alongside reputation fraud, but it's very hard to combat. A cursory investigation of our logs showed these bots (which make ~1 request from a given residential IP) are likely in ranges that our real human users occupy as well.

Simply put you risk blocking legitimate traffic. This solution does as well but for most humans the actual risk is much lower.

As much as I'd love to not need JavaScript and to support users who run with it disabled, I've never once had a customer or end user complain about needing JavaScript enabled.

It is an incredible vocal minority who disapprove of requiring JavaScript, the majority of whom, upon encountering a site for which JavaScript is required, simply enable it. I'd speculate that, even then, only a handful ever release a defeated sigh.

prologic1y ago

This is true. I had some bad actors from the ComCast Network at one point. And unfortunately also valid human users of some of my "things". So I opted not to block the ComCast ASN at that point.

xyzzy_plugh1y ago

Exactly. We've all been down this rabbit hole, collectively, and that's why Anubis has taken off. It works shockingly well.

1 more reply

prologic1y ago

I would be interested to hear of any other solutions that guarantee to either identity or block non-Human traffic. In the "small web" and self-hosting, we typically don't really want Crawlers, and other similar software hitting our services, because often the software is either buggy in the first place (Example: Runaway Claude Bot) or you don't want your sites indexed by them in the first place.

Cyphase1y ago

For anyone wondering, Oracle holds the trademark for "JavaScript": https://javascript.tm/

prologic1y ago

Which arguably they should let go of

jadbox1y ago

How do you know it's an LLM and not a VPN? How do you use this MaxMind's database to isolate LLMs?

prologic1y ago

I don't distinguish actually. There are two things I do normally:

- Block Bad Bots. There's a simple text file called `bad_bots.txt` - Block Bad ASNs. There's a simple text file called `bad_asns.txt`

There's also another for blocking IP(s) and IP-ranges called `bad_ips.txt` but it's often more effective to block an much larger range of IPs (At the ASN level).

To give you an concrete idea, here's some examples:

$ cat etc/caddy/waf/bad_asns.txt # CHINANET-BACKBONE No.31,Jin-rong Street, CN # Why: DDoS 4134

# CHINA169-BACKBONE CHINA UNICOM China169 Backbone, CN # Why: DDoS 4837

# CHINAMOBILE-CN China Mobile Communications Group Co., Ltd., CN # Why: DDoS 9808

# FACEBOOK, US # Why: Bad Bots 32934

# Alibaba, CN # Why: Bad Bots 45102

# Why: Bad Bots 28573

runxiyu1y ago

Do you have a link to your own solution?

JsonCameron1y ago

I have a pretty similar one. (Works off of the same concept) https://github.com/JasonLovesDoggo/caddy-defender if you're curious. Keep in mind this will not protect you against residential IP scraping.

prologic1y ago

Not yet unfortunately. But if you're interested, please reach out! I currently run it in a 3-region GeoDNS setup with my self-hosted infra.

roenxi1y ago

I like the idea but this should probably be something that is pulled down into the protocol level once the nature of the challenge gets sussed out. It'll ultimately be better for accessibility if the PoW challenge is closer to being part of TCP than implemented in JavaScript individually by each website.

pona-a1y ago

There's Cloudflare PrivacyPass that became an IETF standard [0], but it's rather weird, and the reference implementation is a bug nest.

[0] https://datatracker.ietf.org/wg/privacypass/about/

fc417fc8021y ago

Ship an arbitrary challenge as a SPIR-V or MLIR black box. Integrate the challenge-response exchange with HTTP. That should permit broad support and flexible hardware acceleration.

The "good enough" solution is the existing and widely used SHA( seed, nonce ). That could easily be integrated into a lower level of the stack if the tech giants wanted it.

tripdout1y ago

The bot detection takes 5 whole seconds to solve on my phone, wow.

bogwog1y ago

I'm using Fennec (a Firefox fork on F-Droid) and a Pixel 9 Pro XL, and it takes around ~8 seconds at difficulty 4.

Personally, I don't think the UX is that bad since I don't have to do anything. I definitely prefer it to captchas.

Hakkin1y ago

Much better than infinite Cloudflare captcha loops.

gruez1y ago

I've never had that, even with something like tor browser. You must be doing something extra suspicious like an user agent spoofer.

praisewhitey1y ago

Firefox with Enhanced Tracking Protection turned on is enough to trigger it.

2 more replies

megous1y ago

Proper response here is "fuck cloudflare", instead of blaming the user.

1 more reply

xena1y ago

Apparently user-agent switchers don't work for fetch() requests, which means that Anubis can't work with people that do that. I know of someone that set up a version of brave from 2022 with a user-agent saying it's chrome 150 and then complaining about it not working for them.

oynqr1y ago

Lucky. Took 30s for me.

nicce1y ago

For me it is like 0.5s. Interesting.

cookiengineer1y ago

I am currently building a prototype of what I call the "enigma webfont" where I want to implement user sessions with custom seeds / rotations for a served and cached webfont.

The goal is to make web scraping unfeasible because of computational costs for OCR. It's a cat and mouse game right now and I want to change the odds a little. The HTML source would be effectively void without the user session, meaning an OTP like behavior could also make web pages unreadable once the assets go uncached.

This would allow to effectively create a captcha that would modify the local seed window until the user can read a specified word. "Move the slider until you can read the word Foxtrott", for example.

I sure would love to hear your input, Xe. Maybe we can combine our efforts?

My tech stack is go, though, because it was the only language where I could easily change the webfont files directly without issues.

lifthrasiir1y ago

Besides from the obvious accessibility issue, wouldn't that be a substitution cipher at best? Enough corpus should render its cryptanalysis much easier.

cookiengineer1y ago

Well, the idea is basically the same as using AES-CBC. CBC is useless most of the time because of static rotations, but it makes cracking it more expensive.

With the enigma webfont idea you can even just select a random seed for each user/cache session. If you map the URLs based on e.g. SHA512 URLs via the Web Crypto API, there's no cheap way of finding that out without going full in cracking mode or full in OCR/tesseract mode.

And cracking everything first, wasting gigabytes of storage for each amount of rotations and seeds...well, you can try but at this point just ask the admin for the HTML or dataset instead of trying to scrape it, you know.

In regards to accessibility: that's sadly the compromise I am willing to do, if it's a technology that makes my specific projects human eyes only (Literally). I am done taking the costs for hundreds of idiots that are too damn stupid to clone my website from GitHub, letting alone violating every license in each of their jurisdictions. If 99% of traffic is bots, it's essentially DDoSing on purpose.

We have standards for data communication, it's just that none of these vibe coders gives a damn about building semantically correct HTML and parsers for RDF, microdata etc.

lifthrasiir1y ago

No, I was talking about generated fonts themselves; each glyph would have an associated set of control points which can be used to map a glyph to the correct letter. No need to run the full OCR, you need a single small OCR job per each glyph. You would need quite elaborate distortions to avoid this kind of attack, and such distortions may affect the reading experience.

1 more reply

creata1y ago

There's probably something horrific you can do with TrueType to make it more complex than a substitution cipher.

lifthrasiir1y ago

GSUB rules are inherently local, so for example the same cryptanalysis approach should work for space-separated words instead of letters. A polyalphabetic cipher would work better but that means you can't ever share the same internal glyph for visually same but differently encoded letters.

cookiengineer1y ago

The hint I want to give you is: unicode and ligatures :) they're awesome in the worst sense. Words can be ligatures, too, btw.

rollcat1y ago

The problem isn't as much that the websites are scraped (search engines have been doing this for over three decades), it's the request volume that brings the infrastructure down and/or costs up.

I don't think mangling the text would help you, they will just hit you anyway. The traffic patterns seem to indicate that whoever programmed these bots, just... <https://www.youtube.com/watch?v=ulIOrQasR18>

> I sure would love to hear your input, Xe. Maybe we can combine our efforts?

From what I've gathered, they need help in making this project more sustainable for the near and far future, not to add more features. Anubis seems to be doing an excellent job already.

pabs31y ago

It works to block users who have JavaScript disabled, that is for sure.

udev40961y ago

Exactly, it's a really poor attempt to make it appealing to the larger audience. Unless they roll out a version for nojs, they are the same as "AI" scrapers on enshittyfying the web

throwaway1501y ago

Looks cool. But please help me understand. What's to stop AI companies from solving the challenge, completing the proof of work and scrape websites anyway?

crq-yml1y ago

It's a strategy to redefine the doctrine of information warfare on the public Internet from maneuver(leveraged and coordinated usage of resources to create relatively greater effects) towards attrition(resources are poured in indiscriminately until one side capitulates).

Individual humans don't care about a proof-of-work challenge if the information is valuable to them - many web sites already load slowly through a combination of poor coding and spyware ad-tech. But companies care, because that changes their ability to scrape from a modest cost of doing business into a money pit.

In the earlier periods of the web, scraping wasn't necessarily adversarial because search engines and aggregators were serving some public good. In the AI era it's become belligerent - a form of raiding and repackaging credit. Proof of work as a deterrent was proposed to fight spam decades ago(Hashcash) but it's only now that it's really needed to become weaponized.

marginalia_nu1y ago

The problem with scrapers in general is the asymmetry of compute resources involved in generating versus requesting a website. You can likely make millions of HTTP requests with the compute required in generating the average response.

If you make it more expensive to request a documents at scale, you make this type of crawling prohibitively expensive. On a small scale it really doesn't matter, but if you're casting an extremely wide net and re-fetching the same documents hundreds of times, yeah it really does matter. Even if you have a big VC budget.

Nathanba1y ago

Yes but the scraper only has to solve it once and it gets cached too right? Surely it gets cached, otherwise it would be too annoying for humans on phones too? I guess it depends on whether scrapers are just simple curl clients or full headless browsers but I seriously doubt that Google tier LLM scrapers rely on site content loading statically without js.

ndiddy1y ago

AI companies have started using a technique to evade rate limits where they will have a swarm of tens of thousands of scraper bots using unique residential IPs all accessing your site at once. It's very obvious in aggregate that you're being scraped, but when it's happening, it's very difficult to identify scraper vs. non-scraper traffic. Each time a page is scraped, it just looks like a new user from a residential IP is loading a given page.

Anubis helps combat this because even if the scrapers upgrade to running automated copies of full-featured web browsers that are capable of solving the challenges (which means it costs them a lot more to scrape than it currently does), their server costs would balloon even further because each time they load a page, it requires them to solve a new challenge. This means they use a ton of CPU and their throughput goes way down. Even if they solve a challenge, they can't share the cookie between bots because the IP address of the requestor is used as part of the challenge.

2 more replies

Hakkin1y ago

It sets a cookie with a JWT verifying you completed the proof-of-work along with metadata about the origin of the request, the cookie is valid for a week. This is as far as Anubis goes, once you have this cookie you can do whatever you want on the site. For now it seems like enough to stop a decent portion of web crawlers.

You can do more underneath Anubis using the JWT as a sort of session token though, like rate limiting on a per proof-of-work basis, if a client using X token makes more than Y requests in a period of time, invalidate the token and force them to generate a new one. This would force them to either crawl slowly or use many times more resources to crawl your content.

FridgeSeal1y ago

It seems a good chunk of the issue with these modern LLM scrapers is that they are doing _none_ of the normal “sane” things. Caching content, respecting rate limits, using sitemaps, bothering to track explore depth properly, etc.

charcircuit1y ago

If you make it prohibitively expensive almost no regular user will want to wait for it.

xboxnolifes1y ago

Regular users usually aren't page hopping 10 pages per second. A regular user is usually 100 times less than that.

1 more reply

bobmcnamara1y ago

Exponential backoff!

ndiddy1y ago

This makes it much more expensive for them to scrape because they have to run full web browsers instead of limited headless browsers without full Javascript support like they currently do. There's empirical proof that this works. When GNOME deployed it on their Gitlab, they found that around 97% of the traffic in a given 2.5 hour period was blocked by Anubis. https://social.treehouse.systems/@barthalion/114190930216801...

dragonwriter1y ago

> This makes it much more expensive for them to scrape because they have to run full web browsers instead of limited headless browsers without full Javascript support like they currently do. There's empirical proof that this works.

It works in the short term, but the more people that use it, the more likely that scrapers start running full browsers.

SuperNinKenDo1y ago

That's the point. An individual user doesn't lose sleep over using a full browser, that's exactly how they use the web anyway, but for an LLM scraper or similar, this greatly increases costs on their end and partially thereby rebalances the power/cost imbalance, and at the very least, encourages innovations to make the scrapers externalise costs less by not rescraping things over and over again just because you're too lazy, and the weight of doing so is born by somebody else, not you. It's an incentive correction for the commons.

sadeshmukh1y ago

Which are more expensive - you can't run as many especially with Anubis

perching_aix1y ago

Nothing. The idea instead that at scale the expenses of solving the challenges becomes too great.

userbinator1y ago

This is basically the DRM wars again. Those who have vested interests in mass crawling will have the resources to blast through anything, while the legit users get subjected to more and more draconian measures.

SuperNinKenDo1y ago

I'll take this over a Captcha any day.

userbinator1y ago

CAPTCHAs don't need JS, nor does asking a question that an LLM can't answer but a human can.

Proof-of-work selects for those with the computing power and resources to do it. Bitcoin and all the other cryptocurrencies show what happens when you place value on that.

1 more reply

ronsor1y ago

I know companies that already solve it.

wredcoll1y ago

I mean... knowing how to solve it isn't the trick, it's doing it a million times a minute for your firehose scraper.

udev40961y ago

Anubis adds a cookie name `within.website-x-cmd-anubis-auth` which can be used by scrapers for not solving it more than once. Just have a fleet of servers whose sole purpose is to extract the cookie after solving the challenges and make sure all of them stay valid. It's not a big deal

1 more reply

creata1y ago

Why is spending all that CPU time to scrape the handful of sites that use Anubis worth it to them?

vhcr1y ago

Because it's not a lot of CPU, you only have to solve it once per website, and the default policy difficulty of 16 for bots is worthless because you can just change your user agent so you get a difficulty of 4.

mushufasa1y ago

I looked through the documentation and I've come across a couple sites using this already.

Genuine question: why not leverage the proof-of-work challenge literally into mining that generates some revenue for a website? Not a new idea, but when I looked at the docs it didn't seem like this challenge was tied to any monetary coin value.

This is coming from someone who is NOT a big crypto person, but it strikes me that this would be a much better way to monetize organic high quality content in this day and age. Basically the idea that Brave browser started with, meeting it's moment.

I'm sure Xe has already considered this. Do they have a blog post about this anywhere?

snvzz1y ago

My Amiga 1200 hates these tools.

It is really sad that the worldwide web has been taken to the point where this is needed.

pabs31y ago

Recently I heard of a site blocking bot requests with a message telling the bot to download the site via Bittorrent instead.

Seems like a good solution to the badly behaved scrapers, and I feel like the web needs to move away from the client-server model towards a swarm model like Bittorrent anyway.

seba_dos11y ago

Even if these stupid bots would just learn to clone git repos instead of crawling through GitLab UI pages it would already be helpful.

deknos1y ago

I wish, there was also an tunnel software (client+server) where

* the server appears on the outside as an https server/reverse proxy * the server supports self-signed-certificates or letsencrypt * when a client goes to a certain (sub)site or route, http auth can be used * after http auth, all traffic tunnel over that subsite/route is protected against traffic analysis, for example like the obfsproxy does it.

Does anyone know something like that? I am tempted to ask xeiaso to add such features, but i do not think his tool is meant for that...

rollcat1y ago

Your requirements are quite specific, and HTTP servers are built to be generic and flexible. You can probably put something together with nginx and some Lua, aka OpenResty: <https://openresty.org/>

> his

I believe it's their.

deknos1y ago

ups, yes, sorry, their.

immibis1y ago

Tor's Webtunnel?

deknos1y ago

but i do not want to go OVER tor, i just want a service over clearnet? or is this something else? do you have an URL?

immibis1y ago

I presume the protocol can be separated from Tor itself and I also presume this standalone thing doesn't exist yet.

In any situation, you're going to need some custom client code to route your traffic through the tunnel you opened, so I'm not sure why the login page that opens the tunnel needs to be browser-compatible?

dmtfullstack1y ago

Humans are served by bots. Any bot requesting traffic is doing so on behalf of a human somewhere.

What is the problem with bots asking for traffic, exactly?

Context of my perspective: I am a contractor for a team that hosts thousands of websites on a Kubernetes cluster. All of the websites are on a storage cluster (combination of ZFS and Ceph) with SATA and NVMe SSDs. The machines in the storage cluster and also the machines the web endpoints run on have tons of RAM.

We see a lot of traffic from what are obviously scraping bots. They haven't caused any problems.

Tarq0n1y ago

Ok? Not everyone has the same resources or technical sophistication.

udev40961y ago

PoW captchas are not new. What's different with Anubis? How can it possibly prevent "AI" scrapers if the bots have enough compute to solve the PoW challenge? AI companies have quite a lot of GPUs at their disposal and I wouldn't be surprised if they used it for getting around PoW captchas

relistan1y ago

The point is to make it expensive to crawl your site. Anyone determined to do so is not blocked. But why would they be determined to do so for some random site? The value to the AI crawler likely does not match the cost to crawl it. It will just move on to another site.

So the point is not to be faster than the bear. It’s to be faster than your fellow campers.

genewitch1y ago

Why not have them hash pow for btc then?

sprremix1y ago

Why must everything involve $'s?

1 more reply

matt32101y ago

A package which includes the cool artwork would be awesome

xena1y ago

You mean with the art assets extracted?

  $ mkdir -p ./tmp/anubis/static && anubis --extract-resources=./tmp/anubis/static

appleaday11y ago

Nice will try to deploy to my sites after I eat some mac and cheese

matt32101y ago

Very nice work!

anubiskhan1y ago

I approve.

perching_aix1y ago

> Sadly, you must enable JavaScript to get past this challenge. This is required because AI companies have changed the social contract around how website hosting works. A no-JS solution is a work-in-progress.

Will be interested to hear of that. In the meantime, at least I learned of JShelter.

Edit:

Why not use the passage of time as the limiter? I guess it would still require JS though, unless there's some hack possible with CSS animations, like request an image with certain URL params only after an animation finishes.

This does remind me how all of these additional hoops are making web browsing slow.

Edit #2:

Thinking even more about it, time could be made a hurdle by just.. slowly serving incoming requests. No fancy timestamp signing + CSS animations or whatever trickery required.

I'm also not sure if time would make at-scale scraping as much more expensive as PoW does. Time is money, sure, but that much? Also, the UX of it I'm not sold on, but could be mitigated somewhat by doing news website style "I'm only serving the first 20% of my content initially" stuff.

So yeah, will be curious to hear the non-JS solution. The easy way out would be a browser extension, but then it's not really non-JS, just JS compartmentalized, isn't it?

Edit #3:

Turning reasoning on for a moment, this whole thing is a bit iffy.

First of all, the goal is that a website operator would be able to control the use of information they disseminate to the general public via their website, such that it won't be used specifically for AI training. In principle, this is nonsensical. The goal of sharing information with the general public (so, people) involves said information eventually traversing through a non-technological medium (air, as light), to reach a non-technological entity (a person). This means that any technological measure will be limited to before that medium, and won't be able to affect said target either. Put differently, I can rote copy your website out into a text editor, or hold up a camera with OCR and scan the screen, if scale is needed.

So in principle we're definitely hosed, but in practice you can try to hold onto the modality of "scraping for AI training" by leveraging the various technological fingerprints of such activity, which is how we get to at-scale PoW. But then this also combats any other kind of at-scale scraping, such as search engines. You could whitelist specific search engines, but then you're engaging in anti-competitive measures, since smaller third party search engines now have to magically get themselves on your list. And even if they do, they might be lying about being just a search engine, because e.g. Google may scrape your website for search, but will 100% use it for AI training then too.

So I don't really see any technological modality that would be able properly discriminate AI training purposed scraping traffic for you to use PoW or other methods against. You may decide to engage in this regardless based on statistical data, and just live with the negative aspects of your efforts, but then it's a bit iffy.

Finally, what about the energy consumption shaped elephant in the room? Using PoW for this is going basically exactly against the spirit of wanting less energy to be spent on AI and co. That said, this may not be a goal for the author.

The more I think about this, the less sensible and agreeable it is. I don't know man.

Philpax1y ago

> First of all, the goal is that a website operator would be able to control the use of information they disseminate to the general public via their website, such that it won't be used specifically for AI training.

This isn't the goal; the goal is to punish/demotivate poorly-behaved scrapers that hammer servers instead of moderating their scraping behaviour. At least a few of the organisations deploying Anubis are fine with having their data scraped and being made part of an AI model.

They just don't like having their server being flooded with non-organic requests because the people making the scrapers have enough resources that they don't have to care that they're externalising the costs of their malfeasance on the rest of the internet.

perching_aix1y ago

Ah, thanks for the clarification. I guess then it pulling a double duty against all scraping in general is not a flaw either.

abetusk1y ago

Time delay, as you proposed, is easily defeated by concurrent connections. In some sense, you're sacrificing latency without sacrificing throughput.

A bot network can make many connections at once, waiting until the timeout to get the entirety of their (multiple) request(s). Every serial delay you put in is a minor inconvenience to a bot network, since they're automated anyway, but a degrading experience for good faith use.

Time delay solutions get worse for services like posting, account creation, etc. as they're sidestepped by concurrent connections that can wait out the delay to then flood the server.

Requiring proof-of-work costs the agent something in terms of resources. The proof-of-work certificate allows for easy verification (in terms of compute resources) relative to the amount of work to find the certificate in the first place.

A small resource tax on agents has minimal effect on everyday use but has compounding effect for bots, as any bot crawl now needs resources that scale linearly with the number of pages that it requests. Without proof-of-work, the limiting resource for bots is network bandwidth, as processing page data is effectively free relative to bandwidth costs. By requiring work/energy expenditure to requests, bots now have a compute as a bottleneck.

As an analogy, consider if sending an email would cost $0.01. For most people, the number of emails sent over the course of a year could easily cost them less than $20.00, but for spam bots that send email blasts of up to 10k recipients, this now would cost them $100.00 per shot. The tax on individual users is minimal but is significant enough so that mass spam efforts are strained.

It doesn't prevent spam, or bots, entirely, but the point is to provide some friction that's relatively transparent to end users while mitigating abusive use.

marginalia_nu1y ago

You basically need proof-of-work to make this work. Idling a connection is not computationally expensive, so is not a deterrent.

It's a shitty solution to an even shittier reality.

xena1y ago

Main author of Anubis here:

Basically what they said. This is a hack, and it's specifically designed to exploit the infrastructure behind industrial-scale scraping. They usually have a different IP address do the scraping for each page load _but share the cookies between them_. This means that if they use headless chrome, they have to do the proof of work check every time, which scales poorly with the rates I know the headless chrome vendors charge for compute time per page.

ArinaS1y ago

Is there any particular date/time you'll introduce a no-JS solution?

And are you going to support older browsers? I tested Anubis with https://www.browserling.com with its (I think) standard configuration at https://git.xeserv.us/xe/anubis-test/src/branch/main/README.... and apparently it doesn't work with Firefox versions before 74 and Chromium versions before 80.

I wonder if it works with something like Pale Moon.

1 more reply

vhcr1y ago

I used to have an ISP that would load balance your connection between different providers, this meant that pretty much every single request would use a different IP. I know it's not that common, but that would mean real users would find pages using anubis unusable.

lifthrasiir1y ago

Do you think that, if this behavior of Anubis gets well-known and Anubis cookies are specifically handled to avoid pathological PoW checks, does Anubis need a significant rework? Because if it's indeed true this hack wouldn't last much longer and I have no further idea to avoid user-visible annoyances.

1 more reply

specialist1y ago

> Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers

Based on the comments here, it seems like many people are struggling with the concept.

Would calling Anubis a "client-side rate limiter" be accurate (enough)?

1 more reply

1oooqooq1y ago

tries to block abusive companies using infinite money glitch from clueless investors, by making every request cost a few fractions of a cent more.

... yeah, that will totally work.

apt-apt-apt-apt1y ago

Since Anubis is related to AI, the part below read as contradictory at first. As if too many donations would cause the creator to disappear off to Tahiti along with the product development.

"If you are using Anubis .. please donate on Patreon. I would really love to not have to work in generative AI anymore..."

mentalgear1y ago

Seems like a great idea, but I'd be nice if the project had a simple description. (and not use so much anime, as it gives an unprofessional impression)

This is what it actually does: Instead of only letting the provider bear the cost of content hosting (traffic, storage), the client also bears costs when accessing in form of computation. Basically it runs additional expansive computation on the client, which makes accessing 1000s of your webpages at high interval expansive for crawlers.

> Anubis uses a proof of work in order to validate that clients are genuine. The reason Anubis does this was inspired by Hashcash, a suggestion from the early 2000's about extending the email protocol to avoid spam. The idea is that genuine people sending emails will have to do a small math problem that is expensive to compute, but easy to verify such as hashing a string with a given number of leading zeroes. This will have basically no impact on individuals sending a few emails a week, but the company churning out industrial quantities of advertising will be required to do prohibitively expensive computation.

userbinator1y ago

Yes, it just worked to stop me, an actual human, from seeing what you wanted to say... and I'm not interested enough to find a way around it that doesn't involve cozying up to Big Browser. At least CloudFlare's discrimination can be gotten around without JS.

Wouldn't it be ironic if the amount of JS served to a "bot" costs even more bandwidth than the content itself? I've seen that happen with CF before. Also keep in mind that if you anger the wrong people, you might find yourself receiving a real DDoS.

If you want to stop blind bots, perhaps consider asking questions that would easily trip LLMs but not humans. I've seen and used such systems for forum registrations to prevent generic spammers, and they are quite effective.

userbinator1y ago

Looks like I struck a nerve. Big Browser, hello ;-)

1 more reply

j / k navigate · click thread line to collapse

208 comments

gnabgib1y ago

Related Anubis: Proof-of-work proxy to prevent AI crawlers (100 points, 23 days ago, 58 comments) https://news.ycombinator.com/item?id=43427679

raggi1y ago

It's amusing that Xe managed to turn what was historically mostly a joke/shitpost into an actually useful product. They did always say timing was everything.

I am kind of surprised how many sites seem to want/need this. I get the slow git pages problem for some of the git servers that are super deep, lack caches, serve off slow disks, etc.

adrian171y ago

> but of course doing so requires some expertise, and expertise still isn't cheap

raggi1y ago

jtbayly1y ago

mrweasel1y ago

> I am kind of surprised how many sites seem to want/need this.

MrJohz1y ago

But that's just a working theory without much evidence trying to justify why I'm hearing so many people talking about struggling with AI bot traffic and not seeing it myself.

nicolapcweek941y ago

mrweasel1y ago

2 more replies

cedws1y ago

PoW anti-bot/scraping/DDOS was already being done a decade ago, I’m not sure why it’s only catching on now. I even recall a project that tried to make the PoW useful.

xena1y ago

And then the universe blessed me with a natural 20. Never had these problems before. This shit is wild.

underdeserver1y ago

Squeeze that lemon as far as it'll go mate, god speed and may the good luck continue.

gyomu1y ago

If you’re confused about what this is - it’s to prevent AI scraping.

> Anubis uses a proof-of-work challenge to ensure that clients are using a modern browser and are able to calculate SHA-256 checksums

https://anubis.techaro.lol/docs/design/how-anubis-works

This is pretty cool, I have a project or two that might benefit from it.

x3haloed1y ago

xboxnolifes1y ago

> As long as a given client is not abusing your systems, then why do you care if the client is a human?

And these bots aren't some being with rights. They're tools unleashed by humans. It's humans abusing the systems. These are anti-abuse measures.

immibis1y ago

Then you look up their IP address's abuse contact, send an email and get them to either stop attacking you or get booted off the internet so they can't attack you.

And if that doesn't happen, you go to their ISP's ISP and get their ISP booted off the Internet.

Anyone know if it's legal to return a never-ending stream of /dev/urandom based on the user-agent?

4 more replies

bbor1y ago

Well, that's the meta-rub: if they're abusing, block abuse. Rate limits are far simpler, anyway!

1 more reply

t-writescode1y ago

> I personally can’t think of a good reason to specifically try to gate bots

ronsor1y ago

I'm genuinely skeptical that those are all real LLM scrapers. For one, a lot of content is in CommonCrawl and AI companies don't want to redo all that work when they can get some WARC files from AWS.

I'm largely suspecting that these are mostly other bots pretending to be LLM scrapers. Does anyone even check if the bots' IP ranges belong to the AI companies?

4 more replies

praptak1y ago

The good thing about proof of work is that it doesn't specifically gate bots.

fc417fc8021y ago

Things like google are still possible. Operators would need to whitelist services.

brikym1y ago

gbear6051y ago

laserbeam1y ago

The reason is: bots DO spam you repeatedly and increase your network costs. Humans don’t abuse the same way.

starkrights1y ago

1 more reply

mieses1y ago

There is hope for misguided humans.

namanyayg1y ago

"It also uses time as an input, which is known to both the server and requestor due to the nature of linear timelines"

A funny line from his docs

xena1y ago

OMG lol I forgot that I left that in. Hilarious. I think I'm gonna keep it.

didgeoridoo1y ago

I didn’t even blink at this, my inner monologue just did a little “well, naturally” in a Redditor voice and kept reading.

mkl1y ago

BTW Xe, https://xeiaso.net/pronouns is 404 since sometime last year, but it is still linked to from some places like https://xeiaso.net/blog/xe-2021-08-07/ (I saw "his" above and went looking).

xena1y ago

I'm considering making it come back, but it's just gotten me too much abuse so I'm probably gonna leave it 404-ing until society is better.

2 more replies

pie_flavor1y ago

AnonC1y ago

Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute! (I’ve always found all the art and the characters in Xe’s blog very beautiful)

> Installing and using this will likely result in your website not being indexed by some search engines. This is considered a feature of Anubis, not a bug.

> This is a bit of a nuclear response, but AI scraper bots scraping so aggressively have forced my hand.

[1]: https://github.com/TecharoHQ/anubis/

JsonCameron1y ago

Yeah. Unfortunately at the current moment it does prevent indexing. Perhaps down the line we can whitelist search engines ips. However some like google, use the same for the AI and search indexing.

We are still making some improvements like passing open graph tags through so at least rich previews work!

snvzz1y ago

>Those images on the interstitial page(s) while waiting for Anubis to complete its check are so cute!

Love them too, and abhor knowing that someone is bound to eventually remove them because found to be "problematic" in one way or another.

pohuing1y ago

[1] https://discourse.gnome.org/t/anime-girl-on-gnome-gitlab/276...

prologic1y ago

xyzzy_plugh1y ago

Simply put you risk blocking legitimate traffic. This solution does as well but for most humans the actual risk is much lower.

As much as I'd love to not need JavaScript and to support users who run with it disabled, I've never once had a customer or end user complain about needing JavaScript enabled.

prologic1y ago

This is true. I had some bad actors from the ComCast Network at one point. And unfortunately also valid human users of some of my "things". So I opted not to block the ComCast ASN at that point.

xyzzy_plugh1y ago

Exactly. We've all been down this rabbit hole, collectively, and that's why Anubis has taken off. It works shockingly well.

1 more reply

prologic1y ago

Cyphase1y ago

For anyone wondering, Oracle holds the trademark for "JavaScript": https://javascript.tm/

prologic1y ago

Which arguably they should let go of

jadbox1y ago

How do you know it's an LLM and not a VPN? How do you use this MaxMind's database to isolate LLMs?

prologic1y ago

I don't distinguish actually. There are two things I do normally:

- Block Bad Bots. There's a simple text file called `bad_bots.txt` - Block Bad ASNs. There's a simple text file called `bad_asns.txt`

There's also another for blocking IP(s) and IP-ranges called `bad_ips.txt` but it's often more effective to block an much larger range of IPs (At the ASN level).

To give you an concrete idea, here's some examples:

$ cat etc/caddy/waf/bad_asns.txt # CHINANET-BACKBONE No.31,Jin-rong Street, CN # Why: DDoS 4134

# CHINA169-BACKBONE CHINA UNICOM China169 Backbone, CN # Why: DDoS 4837

# CHINAMOBILE-CN China Mobile Communications Group Co., Ltd., CN # Why: DDoS 9808

# FACEBOOK, US # Why: Bad Bots 32934

# Alibaba, CN # Why: Bad Bots 45102

# Why: Bad Bots 28573

runxiyu1y ago

Do you have a link to your own solution?

JsonCameron1y ago

prologic1y ago

Not yet unfortunately. But if you're interested, please reach out! I currently run it in a 3-region GeoDNS setup with my self-hosted infra.

roenxi1y ago

pona-a1y ago

There's Cloudflare PrivacyPass that became an IETF standard [0], but it's rather weird, and the reference implementation is a bug nest.

[0] https://datatracker.ietf.org/wg/privacypass/about/

fc417fc8021y ago

Ship an arbitrary challenge as a SPIR-V or MLIR black box. Integrate the challenge-response exchange with HTTP. That should permit broad support and flexible hardware acceleration.

The "good enough" solution is the existing and widely used SHA( seed, nonce ). That could easily be integrated into a lower level of the stack if the tech giants wanted it.

tripdout1y ago

The bot detection takes 5 whole seconds to solve on my phone, wow.

bogwog1y ago

I'm using Fennec (a Firefox fork on F-Droid) and a Pixel 9 Pro XL, and it takes around ~8 seconds at difficulty 4.

Personally, I don't think the UX is that bad since I don't have to do anything. I definitely prefer it to captchas.

Hakkin1y ago

Much better than infinite Cloudflare captcha loops.

gruez1y ago

I've never had that, even with something like tor browser. You must be doing something extra suspicious like an user agent spoofer.

praisewhitey1y ago

Firefox with Enhanced Tracking Protection turned on is enough to trigger it.

2 more replies

megous1y ago

Proper response here is "fuck cloudflare", instead of blaming the user.

1 more reply

xena1y ago

oynqr1y ago

Lucky. Took 30s for me.

nicce1y ago

For me it is like 0.5s. Interesting.

cookiengineer1y ago

I am currently building a prototype of what I call the "enigma webfont" where I want to implement user sessions with custom seeds / rotations for a served and cached webfont.

This would allow to effectively create a captcha that would modify the local seed window until the user can read a specified word. "Move the slider until you can read the word Foxtrott", for example.

I sure would love to hear your input, Xe. Maybe we can combine our efforts?

My tech stack is go, though, because it was the only language where I could easily change the webfont files directly without issues.

lifthrasiir1y ago

Besides from the obvious accessibility issue, wouldn't that be a substitution cipher at best? Enough corpus should render its cryptanalysis much easier.

cookiengineer1y ago

Well, the idea is basically the same as using AES-CBC. CBC is useless most of the time because of static rotations, but it makes cracking it more expensive.

We have standards for data communication, it's just that none of these vibe coders gives a damn about building semantically correct HTML and parsers for RDF, microdata etc.

lifthrasiir1y ago

1 more reply

creata1y ago

There's probably something horrific you can do with TrueType to make it more complex than a substitution cipher.

lifthrasiir1y ago

cookiengineer1y ago

The hint I want to give you is: unicode and ligatures :) they're awesome in the worst sense. Words can be ligatures, too, btw.

rollcat1y ago

The problem isn't as much that the websites are scraped (search engines have been doing this for over three decades), it's the request volume that brings the infrastructure down and/or costs up.

> I sure would love to hear your input, Xe. Maybe we can combine our efforts?

From what I've gathered, they need help in making this project more sustainable for the near and far future, not to add more features. Anubis seems to be doing an excellent job already.

pabs31y ago

It works to block users who have JavaScript disabled, that is for sure.

udev40961y ago

Exactly, it's a really poor attempt to make it appealing to the larger audience. Unless they roll out a version for nojs, they are the same as "AI" scrapers on enshittyfying the web

throwaway1501y ago

Looks cool. But please help me understand. What's to stop AI companies from solving the challenge, completing the proof of work and scrape websites anyway?

crq-yml1y ago

marginalia_nu1y ago

Nathanba1y ago

ndiddy1y ago

2 more replies

Hakkin1y ago

FridgeSeal1y ago

charcircuit1y ago

If you make it prohibitively expensive almost no regular user will want to wait for it.

xboxnolifes1y ago

Regular users usually aren't page hopping 10 pages per second. A regular user is usually 100 times less than that.

1 more reply

bobmcnamara1y ago

Exponential backoff!

ndiddy1y ago

dragonwriter1y ago

It works in the short term, but the more people that use it, the more likely that scrapers start running full browsers.

SuperNinKenDo1y ago

sadeshmukh1y ago

Which are more expensive - you can't run as many especially with Anubis

perching_aix1y ago

Nothing. The idea instead that at scale the expenses of solving the challenges becomes too great.

userbinator1y ago

SuperNinKenDo1y ago

I'll take this over a Captcha any day.

userbinator1y ago

CAPTCHAs don't need JS, nor does asking a question that an LLM can't answer but a human can.

Proof-of-work selects for those with the computing power and resources to do it. Bitcoin and all the other cryptocurrencies show what happens when you place value on that.

1 more reply

ronsor1y ago

I know companies that already solve it.

wredcoll1y ago

I mean... knowing how to solve it isn't the trick, it's doing it a million times a minute for your firehose scraper.

udev40961y ago

1 more reply

creata1y ago

Why is spending all that CPU time to scrape the handful of sites that use Anubis worth it to them?

vhcr1y ago

mushufasa1y ago

I looked through the documentation and I've come across a couple sites using this already.

I'm sure Xe has already considered this. Do they have a blog post about this anywhere?

snvzz1y ago

My Amiga 1200 hates these tools.

It is really sad that the worldwide web has been taken to the point where this is needed.

pabs31y ago

Recently I heard of a site blocking bot requests with a message telling the bot to download the site via Bittorrent instead.

Seems like a good solution to the badly behaved scrapers, and I feel like the web needs to move away from the client-server model towards a swarm model like Bittorrent anyway.

seba_dos11y ago

Even if these stupid bots would just learn to clone git repos instead of crawling through GitLab UI pages it would already be helpful.

deknos1y ago

I wish, there was also an tunnel software (client+server) where

Does anyone know something like that? I am tempted to ask xeiaso to add such features, but i do not think his tool is meant for that...

rollcat1y ago

Your requirements are quite specific, and HTTP servers are built to be generic and flexible. You can probably put something together with nginx and some Lua, aka OpenResty: <https://openresty.org/>

> his

I believe it's their.

deknos1y ago

ups, yes, sorry, their.

immibis1y ago

Tor's Webtunnel?

deknos1y ago

but i do not want to go OVER tor, i just want a service over clearnet? or is this something else? do you have an URL?

immibis1y ago

I presume the protocol can be separated from Tor itself and I also presume this standalone thing doesn't exist yet.

dmtfullstack1y ago

Humans are served by bots. Any bot requesting traffic is doing so on behalf of a human somewhere.

What is the problem with bots asking for traffic, exactly?

We see a lot of traffic from what are obviously scraping bots. They haven't caused any problems.

Tarq0n1y ago

Ok? Not everyone has the same resources or technical sophistication.

udev40961y ago

relistan1y ago

So the point is not to be faster than the bear. It’s to be faster than your fellow campers.

genewitch1y ago

Why not have them hash pow for btc then?

sprremix1y ago

Why must everything involve $'s?

1 more reply

matt32101y ago

A package which includes the cool artwork would be awesome

xena1y ago

You mean with the art assets extracted?

  $ mkdir -p ./tmp/anubis/static && anubis --extract-resources=./tmp/anubis/static

appleaday11y ago

Nice will try to deploy to my sites after I eat some mac and cheese

matt32101y ago

Very nice work!

anubiskhan1y ago

I approve.

perching_aix1y ago

Will be interested to hear of that. In the meantime, at least I learned of JShelter.

Edit:

This does remind me how all of these additional hoops are making web browsing slow.

Edit #2:

Thinking even more about it, time could be made a hurdle by just.. slowly serving incoming requests. No fancy timestamp signing + CSS animations or whatever trickery required.

So yeah, will be curious to hear the non-JS solution. The easy way out would be a browser extension, but then it's not really non-JS, just JS compartmentalized, isn't it?

Edit #3:

Turning reasoning on for a moment, this whole thing is a bit iffy.

The more I think about this, the less sensible and agreeable it is. I don't know man.

Philpax1y ago

perching_aix1y ago

Ah, thanks for the clarification. I guess then it pulling a double duty against all scraping in general is not a flaw either.

abetusk1y ago

Time delay, as you proposed, is easily defeated by concurrent connections. In some sense, you're sacrificing latency without sacrificing throughput.

Time delay solutions get worse for services like posting, account creation, etc. as they're sidestepped by concurrent connections that can wait out the delay to then flood the server.

It doesn't prevent spam, or bots, entirely, but the point is to provide some friction that's relatively transparent to end users while mitigating abusive use.

marginalia_nu1y ago

You basically need proof-of-work to make this work. Idling a connection is not computationally expensive, so is not a deterrent.

It's a shitty solution to an even shittier reality.

xena1y ago

Main author of Anubis here:

ArinaS1y ago

Is there any particular date/time you'll introduce a no-JS solution?

I wonder if it works with something like Pale Moon.

1 more reply

vhcr1y ago

lifthrasiir1y ago

1 more reply

specialist1y ago

> Weigh the soul of incoming HTTP requests using proof-of-work to stop AI crawlers

Based on the comments here, it seems like many people are struggling with the concept.

Would calling Anubis a "client-side rate limiter" be accurate (enough)?

1 more reply

1oooqooq1y ago

tries to block abusive companies using infinite money glitch from clueless investors, by making every request cost a few fractions of a cent more.

... yeah, that will totally work.

apt-apt-apt-apt1y ago

Since Anubis is related to AI, the part below read as contradictory at first. As if too many donations would cause the creator to disappear off to Tahiti along with the product development.

"If you are using Anubis .. please donate on Patreon. I would really love to not have to work in generative AI anymore..."

mentalgear1y ago

Seems like a great idea, but I'd be nice if the project had a simple description. (and not use so much anime, as it gives an unprofessional impression)

userbinator1y ago

Looks like I struck a nerve. Big Browser, hello ;-)

1 more reply

j / k navigate · click thread line to collapse