¹: https://vercel.com/templates/vercel-firewall/block-ai-bots-f...
²: https://vercel.com/changelog/faster-transformations-and-redu...
1) Our biggest issue right now is unidentified crawlers with user agents resembling regular users. We get hundreds of thousands of requests from those daily and I'm not sure how to block them on Vercel.
I'd love them to be challenged. If a bot doesn't identify itself, we don't want to let it in.
2) While we fixed the Image Optimization part and optimized caching, we're now struggling with ISR Write costs. We deploy often and the ISR cache is reset on each deploy.
We are about to put Cloudflare in front of the site, so that we can set Cache-Control headers and cache SSR pages (rather than using ISR) independently.
I would recommend using Thumbor instead: https://thumbor.readthedocs.io/en/latest/. You could have ChatGPT write up a React image wrapper pretty quickly for this.
> On Feb 18, 2025, just a few days after we published this blog post, Vercel changed their image optimization pricing. With the new pricing we'd not have faced a huge bill.
Works great for us
The length to which many devs will go to not learn server management (or SQL).
At my last job we resized a very large amount of images every day, and did so for significantly cheaper (a fraction of a cent for a thousand images).
Am I missing something here?
Millions of episode, of course they will be visited and the optimization is run.
The site was originally secondary to our business and was built by a contractor. It was secondary to our business and we didn't pay much attention until we actually added the episode pages and the bots discovered them.
I saw a lot of disparaging comments here. It's definitely our fault for not understanding the implications of what the code was doing. We didn't mention the contractor in the post, because we didn't want to throw them under the bus. The accountability is all ours.
Can't the `convert` CLI tool resize images? Can that not be used here instead?
You don't need one. You can fetch RSS feeds directly on mobile devices; it is faster, less work to maintain, and has a smaller attach surface for rouge bots.
I am curious: What do you do with the feeds that can't be done in a client-side app? An aggregation across all users or recommendation system is one thing, but it can even be done via the clients sending analytics data back to the servers.
One of them was pretending to be a very specific version of Microsoft Edge, coming from an Alibaba datacenter. Suuuuuuuuuuuuuuuuuure. Blocked its IP range and about ten minutes later a different subnet was hammering away again. I ended up just blocking based off the first two octets; the client didn't care, none of their visitors are from China.
All of this was sailing right through Cloudflare.
Can be done in two lines in nginx which is not just a common web server but also used as an API gateway or proxy.
You can rate limit by IP pretty aggressively without affecting human traffic.
Interestingly, openai crawler visited over a 1000 times, many of them for "ChatGPT-User/1.0" which is supposed to be for when a user searches chatgpt. Not a single referred visitor though. Makes me wonder if it's any beneficial to the content publishers to allow bot crawls
I ended up banning every SEO bot in robots.txt and a bunch of other bots
Some of the LLM bots will switch to user agent headers that match real browsers if blocked outright.
If they're running JS or WASM, can the JS run a few calls likely to break (e.g., something in the WebGPU API set, since they likely aren't paying for GPUs in their scraping farm)?
For the crawl problem, I want to wait and see whether robots.txt is proved enough to stop GenAI bots from crawling since I confidently believe these GenAI companies are too "well-behaved" to respect robots.txt.
User-agent: * Crawl-Delay: 20
Clear enough. Google, Bing and others respect the limits, and while about half my traffic are bots, they never DoS the site.
When a very well known AI bot crawled my site in august, they fired up everything: fail2ban put them temporarily in jail multiple times, the nginx request limit per ip was serving 426 and 444 to more than half their requests (but they kept hammering the same Urls), and some human users contacted me complaining about the site going 503. I had to block the bot IPs at the firewall. They ignore (if they even read) the robots.txt.
So Metacast generate bot traffic on other websites, presumably to "borrow" their content and serve it to their own users, but they don't like it when others do the same to them.
I'd encourage you to read up on how the podcast ecosystem works.
Podcasts are distributed via RSS feeds hosted all over the internet, but mostly on specialized hosting providers like Transistor, Megaphone, Omny Studio, etc. that are designed to handle huge amounts of traffic.
All podcast apps (literally, all of them) like Apple Podcasts, Spotify, YouTube Music, Overcast, Pocket Casts, etc. constantly crawl and download RSS feeds, artwork images and mp3s from podcast hosts.
This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.
It's crazy how these companies are really fleecing their customers who don't know any better. Is there even a way to tell Vercel: "I only want to spend $10 a month max on this project, CUT ME OFF if I go past it."? This is crazy.
I spend $12 a month on BunnyCDN. And $9 a month on BunnyCDN's image optimizer that allows me to add HTTP params to the url to modify images.
1.33TB of CDN traffic. (ps: can't say enough good things about bunnycdn, such a cool company, does exactly what you pay for nothing more nothing less)
This is nuts dude
Yes actually, there's a lot to complain about with Vercel but to their credit they do offer both soft and hard spending limits, unlike most other newfangled clouds.
OTOH god help you if you're on Netlify, there you're looking at $0.55/GB with unbounded billing...
(I work at Vercel). Yes, there are soft and hard spend limits. OP was using this feature, it's called "spend management": https://vercel.com/docs/spend-management
We sure do, though we were initially confused by the wording. We thought "stop deployment" meant that we wouldn't be able to deploy. So, we had it turned off initially.
@leerob helped us figure it out on the Vercel subreddit, then we turned it on.
Not that much for simple thumbnails in addition. So sad that the trend of "fullstack" engineers being just frontend js/ts devs took off with thousands of companies having no clue at all about how to serve websites, backends and server engineering...
History repeats itself...
Your point is they shouldn't?
Source:
We're going to put Cloudflare in front of our Vercel site and control cache for SSR pages with Cache-Control headers.
Last startup, we ran Astro.js behind CloudFront and we were able to serve pretty large volume of public-facing traffic from just 2 server nodes with 3-tiered caching (Redis for data caching, application level output caching, and CloudFront with CF doing a lot of the heavy lifting)
I can assure you it is not an ad for Vercel.
Will do nothing to mitigate the problem. As is well known, these bots don't respect it.
I've addressed this topic in another comment above and will copy it here.
I'd encourage you to read up on how the podcast ecosystem works.
Podcasts are distributed via RSS feeds hosted all over the internet, but mostly on specialized hosting providers like Transistor, Megaphone, Omny Studio, etc. that are designed to handle huge amounts of traffic.
All podcast apps (literally, all of them) like Apple Podcasts, Spotify, YouTube Music, Overcast, Pocket Casts, etc. constantly crawl and download RSS feeds, artwork images and mp3s from podcast hosts.
This is how podcasts are distributed since they were introduced by Apple in early 2000s. This is why podcasting still remains an open, decentralized ecosystem.
[disclaimer: I run https://pure.md, which helps websites shield from this traffic]
Why?
There's no value to the website for a bot scraping all of their content and then reselling it with no credit or payment to the original author.
The LLM SEO game has only just begun. Things will only go downwards from here
The ball is in their court. You don’t get to demand civility AFTER being a dick. You apologize and HOPE you’re forgiven.
What for? Why would I serve anything to these leeches?
And what if there was some standard sort of way for robots to tell your site what they're trying to do with some sort of verb like GET, PUT, POST, DELETE etc. They could even use a standard way to name the resource they're trying to interact with. Like a universal resource finder of some kind. You could even use identifiers to be specific! Like /items/ gives you a list of items and /items/1.json gives you data about a specific item.
That would be so awesome. The future is amazing.
Only scriptkiddies are getting into problems by such low numbers. Im sure security is your next 'misconfiguration'. Better search an offline job in the entertainment industries.
Some context here.
I was actually a PM at Google & AWS, not an engineer. Even though I have a CS degree, I had not been professionally coding for almost 20 years. This is my comeback to software development and I've got a lot to catch up on. Hope this sets the stage appropriately.
I mentioned in an earlier comment that we didn't actually build the site and it was on a back-burner until we added episode pages and got hit by costs. It's a lesson learned indeed and we're now treating the website as a first-class citizen in our stack.