The Cost of Being Crawled: LLM Bots and Vercel Image API Pricing (opens in new tab)

(metacast.app)

112 pointsnavs1y ago119 comments

119 comments

(I work at Vercel) While it's good our spend limits worked, it clearly was not obvious how to block or challenge AI crawlers¹ from our firewall (which it seems you manually found). We'll surface this better in the UI, and also have more bot protection features coming soon. Also glad our improved image optimization pricing² would have helped. Open to other feedback as well, thanks for sharing.

¹: https://vercel.com/templates/vercel-firewall/block-ai-bots-f...

²: https://vercel.com/changelog/faster-transformations-and-redu...

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

1) Our biggest issue right now is unidentified crawlers with user agents resembling regular users. We get hundreds of thousands of requests from those daily and I'm not sure how to block them on Vercel.

I'd love them to be challenged. If a bot doesn't identify itself, we don't want to let it in.

2) While we fixed the Image Optimization part and optimized caching, we're now struggling with ISR Write costs. We deploy often and the ISR cache is reset on each deploy.

We are about to put Cloudflare in front of the site, so that we can set Cache-Control headers and cache SSR pages (rather than using ISR) independently.

leerob1y ago

Not sure if you will see the reply, but please reach out lee at vercel.com and I'm happy to help.

zamalek1y ago

I'm sure what you can share is limited, as I'm guessing this is cat and mouse. That being said, is there anything you can share about your implementation?

leerob1y ago

We’re working on a bot filtering system that blocks all non-browser traffic by default. Alongside that, we’re building a directory of verified bots, and you’ll be able to opt in to allow traffic only from those trusted sources. Hopefully shipping soon.

1 more reply

cratermoon1y ago

> it clearly was not obvious how to block or challenge AI crawlers

https://xeiaso.net/notes/2025/anubis-works/

majorchord1y ago

Setting the user-agent to curl (and maybe others) completely bypasses anubis.

bhouston1y ago

The issue is Vercel Image API is ridiculously expensive and also not efficient.

I would recommend using Thumbor instead: https://thumbor.readthedocs.io/en/latest/. You could have ChatGPT write up a React image wrapper pretty quickly for this.

styfle1y ago

The article explains that they were using the old Vercel price and that the new price is much cheaper.

> On Feb 18, 2025, just a few days after we published this blog post, Vercel changed their image optimization pricing. With the new pricing we'd not have faced a huge bill.

qudat1y ago

We use imgproxy at https://pico.sh

Works great for us

omnimus1y ago

Maybe atleast link to the project https://imgproxy.net/ next time? Your comment is basically an ad for your product. I am sure many clicked the link expecting there to be some image resizing proxy solution…

gngoo1y ago

I once sat down to calculate the costs of my app if it ever went viral being hosted at vercel. That has put me off on hosting anything on vercel ever or even touching NextJS. It feels like total vendor lock in once you have something running there, and you're kind of end up paying them 10x more than if you had taken the extra time to deploy it yourself.

arkh1y ago

> you're kind of end up paying them 10x more than if you had taken the extra time to deploy it yourself

The length to which many devs will go to not learn server management (or SQL).

einsteinx21y ago

See also the entire job of “AWS Cloud Engineer” aka “I want to spend years learning how to manage proprietary infrastructure instead of just learning Linux server management” and the companies that hire them aka “we don’t have money to hire sysadmins to run servers, that’s crazy! Instead let’s pay the same salaries for a team of cloud engineers and be locked in to a single vendor paying 10x the price for infra!” It’s honestly mind boggling to me.

1 more reply

sharps_xp1y ago

i also do the sit down a calculate exercise. i always end up down a rabbit hole of how to make a viral site as cheaply as possible. always ends up in the same place: redis, sqlite, SSE, on suspended fly machines, and a CDN.

jhgg1y ago

$5 to resize 1,000 images is ridiculously expensive.

At my last job we resized a very large amount of images every day, and did so for significantly cheaper (a fraction of a cent for a thousand images).

Am I missing something here?

jsheard1y ago

It's the usual PaaS convenience tax, you end up paying an order of magnitude or so premium for the underlying bandwidth and compute. AIUI Vercel runs on AWS so in their case it's a compound platform tax, AWS is expensive even before Vercel adds their own margin on top.

cachedthing01y ago

I would call it ignorance tax, paas can be fine if you know what you are doing.

leerob1y ago

(I work at Vercel) We moved to a transformation-based price: https://x.com/TheBuildLog/status/1892308957865111918

jhgg1y ago

Sweet! That's much more reasonable!

mvdtnz1y ago

You're not missing anything. A generation of programmers has been raised to believe platforms like Vercel / Next.js are not only normal, but ideal.

BonoboIO1y ago

Absolutely insane pricing, maybe for small blogs, but didn’t they calculate this trough?

Millions of episode, of course they will be visited and the optimization is run.

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

The site was originally secondary to our business and was built by a contractor. It was secondary to our business and we didn't pay much attention until we actually added the episode pages and the bots discovered them.

I saw a lot of disparaging comments here. It's definitely our fault for not understanding the implications of what the code was doing. We didn't mention the contractor in the post, because we didn't want to throw them under the bus. The accountability is all ours.

Banditoz1y ago

Yeah, curious too.

Can't the `convert` CLI tool resize images? Can that not be used here instead?

giantrobot1y ago

Whoa there Boomer, that doesn't sound like it uses enough npm packages! It also doesn't sound very web scale. /s

ashishb1y ago

As someone who maintains a Music+Podcast app as a hobby project, I intentionally have no servers for it.

You don't need one. You can fetch RSS feeds directly on mobile devices; it is faster, less work to maintain, and has a smaller attach surface for rouge bots.

arresin1y ago

If you want to do something interesting with the feeds it would be harder.

ashishb1y ago

> If you want to do something interesting with the feeds it would be harder.

I am curious: What do you do with the feeds that can't be done in a client-side app? An aggregation across all users or recommendation system is one thing, but it can even be done via the clients sending analytics data back to the servers.

1 more reply

VladVladikoff1y ago

Death by stupid micro services. Even at 1.5 mil pages, and the traffic they are talking about this could easily be hosted on a a fixed $80/month linode.

KennyBlanken1y ago

This isn't specific to microservices. I've seen two organizations with a lot of content have their website brought to its knees because multiple AI crawlers were hitting it.

One of them was pretending to be a very specific version of Microsoft Edge, coming from an Alibaba datacenter. Suuuuuuuuuuuuuuuuuure. Blocked its IP range and about ten minutes later a different subnet was hammering away again. I ended up just blocking based off the first two octets; the client didn't care, none of their visitors are from China.

All of this was sailing right through Cloudflare.

VladVladikoff1y ago

I’ve dealt with AI crawlers. I’ve even seen 8 different AI crawlers at once. And yes some have been very aggressive, and I have even blocked some who are particularly bad (ignoring robots.txt rules). But their traffic is a tiny fraction of what my infrastructure sees on a regular basis. A well optimized platform, with good caching, shouldn’t really struggle with a few crawlers.

afarah11y ago

Honest question, why is rate limiting insufficient?

Can be done in two lines in nginx which is not just a common web server but also used as an API gateway or proxy.

You can rate limit by IP pretty aggressively without affecting human traffic.

1 more reply

ramesh311y ago

The cost of getting locked into Vercel.

nullorempty1y ago

Yeah, AI crawlers - add that to my list of phobias. Though for a bootstrapped startup why not look to cut all recurrent expenses and just deploy imagemagik that I am sure will do the trick for less.

GodelNumbering1y ago

Wow this is interesting. I launched my site like a week ago, only submitted to google. But all the crawlers (especially the SEO bots) mentioned in the article were heavily crawling it in a few days.

Interestingly, openai crawler visited over a 1000 times, many of them for "ChatGPT-User/1.0" which is supposed to be for when a user searches chatgpt. Not a single referred visitor though. Makes me wonder if it's any beneficial to the content publishers to allow bot crawls

I ended up banning every SEO bot in robots.txt and a bunch of other bots

marcusb1y ago

I've seen a bunch of requests with forged ChatGPT-related user agent headers (at least, I believe many are forged - I don't think OpenAI uses Chinese residential IPs or Tencent cloud for their data crawling activities.)

Some of the LLM bots will switch to user agent headers that match real browsers if blocked outright.

GodelNumbering1y ago

I checked IPs on those, they belonged to MSFT

hansvm1y ago

Does it suffice to load the content with JS or WASM to keep them out, or are they using some sort of emulated/headless browser?

If they're running JS or WASM, can the JS run a few calls likely to break (e.g., something in the WebGPU API set, since they likely aren't paying for GPUs in their scraping farm)?

1 more reply

outloudvi1y ago

Vercel has a fairly generous free quota and a non-negligible high pricing scheme - I think people still remember https://service-markup.vercel.app/ .

For the crawl problem, I want to wait and see whether robots.txt is proved enough to stop GenAI bots from crawling since I confidently believe these GenAI companies are too "well-behaved" to respect robots.txt.

otherme1231y ago

This is my experience with AI bots. This is my robots.txt:

User-agent: * Crawl-Delay: 20

Clear enough. Google, Bing and others respect the limits, and while about half my traffic are bots, they never DoS the site.

When a very well known AI bot crawled my site in august, they fired up everything: fail2ban put them temporarily in jail multiple times, the nginx request limit per ip was serving 426 and 444 to more than half their requests (but they kept hammering the same Urls), and some human users contacted me complaining about the site going 503. I had to block the bot IPs at the firewall. They ignore (if they even read) the robots.txt.

dvrj1011y ago

Nope they have been ignoring robots.txt since the start. There are multiple posts all over the internet.

randunel1y ago

> Optimizing an image meant that Next.js downloaded the image from one of those hosts to Vercel first, optimized it, then served to the users.

So Metacast generate bot traffic on other websites, presumably to "borrow" their content and serve it to their own users, but they don't like it when others do the same to them.

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

I'd encourage you to read up on how the podcast ecosystem works.

Podcasts are distributed via RSS feeds hosted all over the internet, but mostly on specialized hosting providers like Transistor, Megaphone, Omny Studio, etc. that are designed to handle huge amounts of traffic.

All podcast apps (literally, all of them) like Apple Podcasts, Spotify, YouTube Music, Overcast, Pocket Casts, etc. constantly crawl and download RSS feeds, artwork images and mp3s from podcast hosts.

This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.

randunel1y ago

Do you or do you not visit and respect "robots.txt" on the hosts you've mentioned in your blog post as downloading via next.js?

sergiotapia1y ago

Another story for https://serverlesshorrors.com/

It's crazy how these companies are really fleecing their customers who don't know any better. Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."? This is crazy.

I spend $12 a month on BunnyCDN. And $9 a month on BunnyCDN's image optimizer that allows me to add HTTP params to the url to modify images.

1.33TB of CDN traffic. (ps: can't say enough good things about bunnycdn, such a cool company, does exactly what you pay for nothing more nothing less)

This is nuts dude

jsheard1y ago

> Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."?

Yes actually, there's a lot to complain about with Vercel but to their credit they do offer both soft and hard spending limits, unlike most other newfangled clouds.

OTOH god help you if you're on Netlify, there you're looking at $0.55/GB with unbounded billing...

leerob1y ago

> Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."? This is crazy.

(I work at Vercel). Yes, there are soft and hard spend limits. OP was using this feature, it's called "spend management": https://vercel.com/docs/spend-management

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

We sure do, though we were initially confused by the wording. We thought "stop deployment" meant that we wouldn't be able to deploy. So, we had it turned off initially.

@leerob helped us figure it out on the Vercel subreddit, then we turned it on.

sgarland1y ago

+1 for BunnyCDN. It's fantastic.

greatgib1y ago

A single $5 vps should be able to handle easily tens of thousands of requests...

Not that much for simple thumbnails in addition. So sad that the trend of "fullstack" engineers being just frontend js/ts devs took off with thousands of companies having no clue at all about how to serve websites, backends and server engineering...

bigiain1y ago

It's 1999 or 2000, and "proper" web developers, who wrote Perl (as God intended) or possibly C (if they were contributors to the Apache project), started to notice the trend of Graphic Designers over-reaching from their place as html jockeys, and running whole dynamic websites using some abomination called PHP.

History repeats itself...

XCSme1y ago

I still use PHP, is your point that everyone will happily use NextJS in 10 years? I doubt it.

2 more replies

navsOP1y ago

> started to notice the trend of Graphic Designers over-reaching from their place as html jockeys, and running whole dynamic websites using some abomination called PHP.

Your point is they shouldn't?

1 more reply

majorchord1y ago

> A single $5 vps should be able to handle easily tens of thousands of requests...

Source:

harrisi1y ago

https://news.ycombinator.com/item?id=34676186

1 more reply

e____g1y ago

> A single $5 vps should be able to handle easily tens of thousands of requests

Sure, given enough time. Did you miss a denominator?

SkiFire131y ago

Nha, obviously they meant that the vps will die after those thousands of requests and you will have to buy a new one /s

mediumsmart1y ago

Don’t feed the bots. Why a pixel image? Take an svg and make it pulse while playing.

CharlieDigital1y ago

Is there no CDN? This feels like it's a non-issue if there's a CDN.

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

We're going to put Cloudflare in front of our Vercel site and control cache for SSR pages with Cache-Control headers.

CharlieDigital1y ago

I'm kind of surprised that Next.js -- being known for SSR and SSG -- isn't offered with a CDN solution OOB on Vercel; seems like a no-brainer.

Last startup, we ran Astro.js behind CloudFront and we were able to serve pretty large volume of public-facing traffic from just 2 server nodes with 3-tiered caching (Redis for data caching, application level output caching, and CloudFront with CF doing a lot of the heavy lifting)

dylan6041y ago

I guess it goes to show how jaded I am, but as I was reading this, it felt like an ad for Vercel. I'm so sick of marketing content being submitted as actual content, that when I read a potentially actual blog/post-mortem, my spidey senses get all tingly about potential advertising. However, I feel like if I turn down the sensitivity knob, I'll be worse off than knee jerk thinking things like this are ads.

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

I can assure you it is not an ad for Vercel.

bitbasher1y ago

$5 for 1,000 image optimizations? Is Vercel not caching the optimization? Why would it be doing more than one per-image on a fresh deploy?

cratermoon1y ago

"Step 3: robots.txt"

Will do nothing to mitigate the problem. As is well known, these bots don't respect it.

randunel1y ago

Would you reckon OP's bot(s) respect it when borrowing content from the large variety (their words) of podcast sources they scrape?

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

I've addressed this topic in another comment above and will copy it here.

I'd encourage you to read up on how the podcast ecosystem works.

This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.

2 more replies

andrethegiant1y ago

It’s a shame that the knee-jerk reaction has been to outright block these bots. I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides.

[disclaimer: I run https://pure.md, which helps websites shield from this traffic]

mtlynch1y ago

>I think in the future, websites will learn to serve pure markdown to these bots instead of blocking. That way, websites prevent bandwidth overages like in the article, while still informing LLMs about the services their website provides.

Why?

There's no value to the website for a bot scraping all of their content and then reselling it with no credit or payment to the original author.

wongarsu1y ago

Unless you're selling something. If you have articles praising your product/service/person and "comparison" articles of the "top 10 X 2025" (your offering happens to be number one) you want the bots to find you.

The LLM SEO game has only just begun. Things will only go downwards from here

1 more reply

randunel1y ago

OP in this case is by no means the original author. In this linked post, they mentioned they scrape third parties themselves. OP's bots might not be as sophisticated, but they're still "borrowing" others' content the same way.

andrethegiant1y ago

ChatGPT and others have some sort of attribution, where they link to the original webpage. How or when they decide to attribute is unclear. But websites are starting to pay attention to GEO (generative engine optimization) so that their brand isn’t entirely ignored by ChatGPT and others.

2 more replies

dmitrygr1y ago

Until these bots become good citizens (eg: respecting robots.txt), I will be serving them gzipped gibberish that decompresses to terabytes.

The ball is in their court. You don’t get to demand civility AFTER being a dick. You apologize and HOPE you’re forgiven.

randunel1y ago

What do you reckon, does OP in this post respect robots.txt or do they "borrow" content in a similar manner, without respecting such standards?

AlienRobot1y ago

I thought the AI wars would be fought with bombs vs. bots not with ZIP bombs vs. bots.

pavel_lishin1y ago

> I think in the future, websites will learn to serve pure markdown to these bots instead of blocking.

What for? Why would I serve anything to these leeches?

randunel1y ago

Because you, in this case OP, also generates bot traffic to "borrow" content from other websites to serve to their own users. Ironic, no?

1 more reply

RamblingCTO1y ago

I think you're a bit late to the game ;) I built and sold 2markdown last year, which was then copied by firecrawl/mendable. And then you also have jina reader. Also "compare with" in the footer does nothing.

riffic1y ago

Markdown over HTTPS reminds me a bit of the gemini protocol:

https://en.wikipedia.org/wiki/Gemini_(protocol)

Swizec1y ago

If only there were some way for websites to serve information and provide interactivity in a machine readable format. Like some sort of application programming interface. You could even return different formats based on some sort of 4-letter code at the end of a URL like .html, .json, .xml, etc.

And what if there was some standard sort of way for robots to tell your site what they're trying to do with some sort of verb like GET, PUT, POST, DELETE etc. They could even use a standard way to name the resource they're trying to interact with. Like a universal resource finder of some kind. You could even use identifiers to be specific! Like /items/ gives you a list of items and /items/1.json gives you data about a specific item.

That would be so awesome. The future is amazing.

marcusb1y ago

The only thing that would make that even more perfect would be if there was some way for the site owner to signal to prospective bots which parts of the site are open to the bots to visit. I know this seems really complicated, but I really think it could be expressed in a simple text file.

1 more reply

mubou1y ago

Accept: and rel="alternate" were literally made for this

tough1y ago

how would one serve them .txt instead?

andrethegiant1y ago

Add a Cloudflare snippet / some other edge function, and transform the response to convert to plaintext

happyzappy1y ago

Cool globe graphic on that site :)

detaro1y ago

or you know, AI crawlers could behave and get all that without any extra work for everybody. What makes you think they'll suddenly respect your scheme?

cachedthing01y ago

"Together they sent 66.5k requests to our site within a single day."

Only scriptkiddies are getting into problems by such low numbers. Im sure security is your next 'misconfiguration'. Better search an offline job in the entertainment industries.

aledalgrande1y ago

I know the language earned you the downvotes (please be kind), but the author of the article is ex Google and ex AWS, I too would expect some better infra in place (caching?) and certainly not Vercel.

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

Some context here.

I was actually a PM at Google & AWS, not an engineer. Even though I have a CS degree, I had not been professionally coding for almost 20 years. This is my comeback to software development and I've got a lot to catch up on. Hope this sets the stage appropriately.

I mentioned in an earlier comment that we didn't actually build the site and it was on a back-burner until we added episode pages and got hit by costs. It's a lesson learned indeed and we're now treating the website as a first-class citizen in our stack.

1 more reply

j / k navigate · click thread line to collapse

119 comments

leerob1y ago

¹: https://vercel.com/templates/vercel-firewall/block-ai-bots-f...

²: https://vercel.com/changelog/faster-transformations-and-redu...

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

I'd love them to be challenged. If a bot doesn't identify itself, we don't want to let it in.

2) While we fixed the Image Optimization part and optimized caching, we're now struggling with ISR Write costs. We deploy often and the ISR cache is reset on each deploy.

We are about to put Cloudflare in front of the site, so that we can set Cache-Control headers and cache SSR pages (rather than using ISR) independently.

leerob1y ago

Not sure if you will see the reply, but please reach out lee at vercel.com and I'm happy to help.

zamalek1y ago

I'm sure what you can share is limited, as I'm guessing this is cat and mouse. That being said, is there anything you can share about your implementation?

leerob1y ago

1 more reply

cratermoon1y ago

> it clearly was not obvious how to block or challenge AI crawlers

https://xeiaso.net/notes/2025/anubis-works/

majorchord1y ago

Setting the user-agent to curl (and maybe others) completely bypasses anubis.

bhouston1y ago

The issue is Vercel Image API is ridiculously expensive and also not efficient.

I would recommend using Thumbor instead: https://thumbor.readthedocs.io/en/latest/. You could have ChatGPT write up a React image wrapper pretty quickly for this.

styfle1y ago

The article explains that they were using the old Vercel price and that the new price is much cheaper.

> On Feb 18, 2025, just a few days after we published this blog post, Vercel changed their image optimization pricing. With the new pricing we'd not have faced a huge bill.

qudat1y ago

We use imgproxy at https://pico.sh

Works great for us

omnimus1y ago

gngoo1y ago

arkh1y ago

> you're kind of end up paying them 10x more than if you had taken the extra time to deploy it yourself

The length to which many devs will go to not learn server management (or SQL).

einsteinx21y ago

1 more reply

sharps_xp1y ago

jhgg1y ago

$5 to resize 1,000 images is ridiculously expensive.

At my last job we resized a very large amount of images every day, and did so for significantly cheaper (a fraction of a cent for a thousand images).

Am I missing something here?

jsheard1y ago

cachedthing01y ago

I would call it ignorance tax, paas can be fine if you know what you are doing.

leerob1y ago

(I work at Vercel) We moved to a transformation-based price: https://x.com/TheBuildLog/status/1892308957865111918

jhgg1y ago

Sweet! That's much more reasonable!

mvdtnz1y ago

You're not missing anything. A generation of programmers has been raised to believe platforms like Vercel / Next.js are not only normal, but ideal.

BonoboIO1y ago

Absolutely insane pricing, maybe for small blogs, but didn’t they calculate this trough?

Millions of episode, of course they will be visited and the optimization is run.

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

Banditoz1y ago

Yeah, curious too.

Can't the `convert` CLI tool resize images? Can that not be used here instead?

giantrobot1y ago

Whoa there Boomer, that doesn't sound like it uses enough npm packages! It also doesn't sound very web scale. /s

ashishb1y ago

As someone who maintains a Music+Podcast app as a hobby project, I intentionally have no servers for it.

You don't need one. You can fetch RSS feeds directly on mobile devices; it is faster, less work to maintain, and has a smaller attach surface for rouge bots.

arresin1y ago

If you want to do something interesting with the feeds it would be harder.

ashishb1y ago

> If you want to do something interesting with the feeds it would be harder.

1 more reply

VladVladikoff1y ago

Death by stupid micro services. Even at 1.5 mil pages, and the traffic they are talking about this could easily be hosted on a a fixed $80/month linode.

KennyBlanken1y ago

This isn't specific to microservices. I've seen two organizations with a lot of content have their website brought to its knees because multiple AI crawlers were hitting it.

All of this was sailing right through Cloudflare.

VladVladikoff1y ago

afarah11y ago

Honest question, why is rate limiting insufficient?

Can be done in two lines in nginx which is not just a common web server but also used as an API gateway or proxy.

You can rate limit by IP pretty aggressively without affecting human traffic.

1 more reply

ramesh311y ago

The cost of getting locked into Vercel.

nullorempty1y ago

Yeah, AI crawlers - add that to my list of phobias. Though for a bootstrapped startup why not look to cut all recurrent expenses and just deploy imagemagik that I am sure will do the trick for less.

GodelNumbering1y ago

Wow this is interesting. I launched my site like a week ago, only submitted to google. But all the crawlers (especially the SEO bots) mentioned in the article were heavily crawling it in a few days.

I ended up banning every SEO bot in robots.txt and a bunch of other bots

marcusb1y ago

Some of the LLM bots will switch to user agent headers that match real browsers if blocked outright.

GodelNumbering1y ago

I checked IPs on those, they belonged to MSFT

hansvm1y ago

Does it suffice to load the content with JS or WASM to keep them out, or are they using some sort of emulated/headless browser?

If they're running JS or WASM, can the JS run a few calls likely to break (e.g., something in the WebGPU API set, since they likely aren't paying for GPUs in their scraping farm)?

1 more reply

outloudvi1y ago

Vercel has a fairly generous free quota and a non-negligible high pricing scheme - I think people still remember https://service-markup.vercel.app/ .

otherme1231y ago

This is my experience with AI bots. This is my robots.txt:

User-agent: * Crawl-Delay: 20

Clear enough. Google, Bing and others respect the limits, and while about half my traffic are bots, they never DoS the site.

dvrj1011y ago

Nope they have been ignoring robots.txt since the start. There are multiple posts all over the internet.

randunel1y ago

> Optimizing an image meant that Next.js downloaded the image from one of those hosts to Vercel first, optimized it, then served to the users.

So Metacast generate bot traffic on other websites, presumably to "borrow" their content and serve it to their own users, but they don't like it when others do the same to them.

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

I'd encourage you to read up on how the podcast ecosystem works.

This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.

randunel1y ago

Do you or do you not visit and respect "robots.txt" on the hosts you've mentioned in your blog post as downloading via next.js?

sergiotapia1y ago

Another story for https://serverlesshorrors.com/

I spend $12 a month on BunnyCDN. And $9 a month on BunnyCDN's image optimizer that allows me to add HTTP params to the url to modify images.

1.33TB of CDN traffic. (ps: can't say enough good things about bunnycdn, such a cool company, does exactly what you pay for nothing more nothing less)

This is nuts dude

jsheard1y ago

> Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."?

Yes actually, there's a lot to complain about with Vercel but to their credit they do offer both soft and hard spending limits, unlike most other newfangled clouds.

OTOH god help you if you're on Netlify, there you're looking at $0.55/GB with unbounded billing...

leerob1y ago

> Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."? This is crazy.

(I work at Vercel). Yes, there are soft and hard spend limits. OP was using this feature, it's called "spend management": https://vercel.com/docs/spend-management

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

We sure do, though we were initially confused by the wording. We thought "stop deployment" meant that we wouldn't be able to deploy. So, we had it turned off initially.

@leerob helped us figure it out on the Vercel subreddit, then we turned it on.

sgarland1y ago

+1 for BunnyCDN. It's fantastic.

greatgib1y ago

A single $5 vps should be able to handle easily tens of thousands of requests...

bigiain1y ago

History repeats itself...

XCSme1y ago

I still use PHP, is your point that everyone will happily use NextJS in 10 years? I doubt it.

2 more replies

navsOP1y ago

> started to notice the trend of Graphic Designers over-reaching from their place as html jockeys, and running whole dynamic websites using some abomination called PHP.

Your point is they shouldn't?

1 more reply

majorchord1y ago

> A single $5 vps should be able to handle easily tens of thousands of requests...

Source:

harrisi1y ago

https://news.ycombinator.com/item?id=34676186

1 more reply

e____g1y ago

> A single $5 vps should be able to handle easily tens of thousands of requests

Sure, given enough time. Did you miss a denominator?

SkiFire131y ago

Nha, obviously they meant that the vps will die after those thousands of requests and you will have to buy a new one /s

mediumsmart1y ago

Don’t feed the bots. Why a pixel image? Take an svg and make it pulse while playing.

CharlieDigital1y ago

Is there no CDN? This feels like it's a non-issue if there's a CDN.

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

We're going to put Cloudflare in front of our Vercel site and control cache for SSR pages with Cache-Control headers.

CharlieDigital1y ago

I'm kind of surprised that Next.js -- being known for SSR and SSG -- isn't offered with a CDN solution OOB on Vercel; seems like a no-brainer.

dylan6041y ago

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

I can assure you it is not an ad for Vercel.

bitbasher1y ago

$5 for 1,000 image optimizations? Is Vercel not caching the optimization? Why would it be doing more than one per-image on a fresh deploy?

cratermoon1y ago

"Step 3: robots.txt"

Will do nothing to mitigate the problem. As is well known, these bots don't respect it.

randunel1y ago

Would you reckon OP's bot(s) respect it when borrowing content from the large variety (their words) of podcast sources they scrape?

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

I've addressed this topic in another comment above and will copy it here.

I'd encourage you to read up on how the podcast ecosystem works.

This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.

2 more replies

andrethegiant1y ago

[disclaimer: I run https://pure.md, which helps websites shield from this traffic]

mtlynch1y ago

Why?

There's no value to the website for a bot scraping all of their content and then reselling it with no credit or payment to the original author.

wongarsu1y ago

The LLM SEO game has only just begun. Things will only go downwards from here

1 more reply

randunel1y ago

andrethegiant1y ago

2 more replies

dmitrygr1y ago

Until these bots become good citizens (eg: respecting robots.txt), I will be serving them gzipped gibberish that decompresses to terabytes.

The ball is in their court. You don’t get to demand civility AFTER being a dick. You apologize and HOPE you’re forgiven.

randunel1y ago

What do you reckon, does OP in this post respect robots.txt or do they "borrow" content in a similar manner, without respecting such standards?

AlienRobot1y ago

I thought the AI wars would be fought with bombs vs. bots not with ZIP bombs vs. bots.

pavel_lishin1y ago

> I think in the future, websites will learn to serve pure markdown to these bots instead of blocking.

What for? Why would I serve anything to these leeches?

randunel1y ago

Because you, in this case OP, also generates bot traffic to "borrow" content from other websites to serve to their own users. Ironic, no?

1 more reply

RamblingCTO1y ago

riffic1y ago

Markdown over HTTPS reminds me a bit of the gemini protocol:

https://en.wikipedia.org/wiki/Gemini_(protocol)

Swizec1y ago

That would be so awesome. The future is amazing.

marcusb1y ago

1 more reply

mubou1y ago

Accept: and rel="alternate" were literally made for this

tough1y ago

how would one serve them .txt instead?

andrethegiant1y ago

Add a Cloudflare snippet / some other edge function, and transform the response to convert to plaintext

happyzappy1y ago

Cool globe graphic on that site :)

detaro1y ago

or you know, AI crawlers could behave and get all that without any extra work for everybody. What makes you think they'll suddenly respect your scheme?

cachedthing01y ago

"Together they sent 66.5k requests to our site within a single day."

Only scriptkiddies are getting into problems by such low numbers. Im sure security is your next 'misconfiguration'. Better search an offline job in the entertainment industries.

aledalgrande1y ago

ilyabez1y ago

Hi, I'm the author of the blog (though I didn't post it on HN).

Some context here.

1 more reply

j / k navigate · click thread line to collapse