And it’s always the same - clouds refuse to provide anything more than alerts (that are delayed) and your only option is prayer and begging for mercy.
Followed by people claiming with absolute certainty that it’s literally technically impossible to provide hard capped accounts to tinkerers despite there being accounts like that in existence already (some azure accounts are hardcapped by amount but ofc that’s not loudly advertised).
I've used AWS for about 10 years and am by no means an expert, but I've seen all kinds of ugly cracks and discontinuities in design and operation among the services. AWS has felt like a handful of very good ideas, designed, built, and maintained by completely separate teams, littered by a whole ton of "I need my promotion to VP" bad ideas that build on top of the good ones in increasingly hacky ways.
And in any sufficiently large tech orgnization, there won't be anyone at a level of power who can rattle cages about a problem like this, who will want to be the one to do actually it. No "VP of Such and Such" will spend their political capital stressing how critical it is that they fix the thing that will make a whole bunch of KPIs go in the wrong direction. They're probably spending it on shipping another hacked-together service with Web2.0-- er. IOT-- er. Blockchai-- er. Crypto-- er. AI before promotion season.
It wasn't when the service was first created. What's intentionally malicious is not fixing it for years.
Somehow AI companies got this right form the get go. Money up front, no money, no tokens.
It's easy to guess why. Unlike hosting infra bs, inference is a hard cost for them. If they don't get paid, they lose (more) money. And sending stuff to collections is expensive and bad press.
That’s not a completely accurate characterization of what’s been happening. AI coding agent startups like Cursor and Windsurf started by attracting developers with free or deeply discounted tokens, then adjusted the pricing as they figure out how to be profitable. This happened with Kiro too[1] and is happening now with Google’s Antigravity. There’s been plenty of ink spilled on HN about this practice.
[1] disclaimer: I work for AWS, opinions are my own
I dunno, Aurora’s pricing structure feels an awful lot like that. “What if we made people pay for storage and I/O? And we made estimating I/O practically impossible?”
It's someone in a Patagonia vest trying to avoid getting PIP'd.
Unfortunately, that's not correct. A multi-trillion dollar company most absolutely has not just such a person, but many departments with hundreds of people tasked with precisely that, maximizing revenue by exploting every dark pattern they can possibly think of.
It would be good to provide a factual basis for such a confident contradiction of the GP. This reads as “no, your opinion is wrong because my opinion is right”.
I have budgets set up and alerts through a separate alerting service that pings me if my estimates go above what I've set for a month. But it wouldn't fix a short term mistake; I don't need it to.
The lack of business case is the most likely culprit. "You want to put engineering resources into something that only the $100/mo guys are going to use?"
You might be tempted to think "but my big org will use that", but I can guarantee compliance will shut it down -- you will never be permitted to enable a feature that intentionally causes hard downtime when (some external factor) happens.
It solves the problem of unexpected requests or data transfer increasing your bill across several services.
https://aws.amazon.com/blogs/networking-and-content-delivery...
Does "data transfer" not mean CDN bandwidth here? Otherwise, that price seems two orders of magnitude less than I would expect
[edit: looks like there's no overages but they may force you to flip to the next tier and seems like they will throttle you https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope....]
https://news.ycombinator.com/item?id=45975411
I agree that it’s likely very technically difficult to find the right balance between capping costs and not breaking things, but this shows that it’s definitely possible, and hopefully this signals that AWS is interested in doing this in other services too.
Still sounds kind of ugly.
You can transfer from S3 on a single instance usually as fast as the instances NIC--100Gbps+
You'd need a synchronous system that checks quotas before each request and for a lot of systems you'd also need request cancellation (imagine transferring a 5TiB file from S3 and your cap triggers at 100GiB--the server needs to be able to receive a billing violation alert in real time and cancel the request)
I imagine anything capped provided to customers already AWS just estimates and eats the loss
Obviously such a system is possible since IAM/STS mostly do this but I suspect it's a tradeoff providers are reluctant to make
It's easier to waive cost overages than deal with any of that.
AWS is less like your garage door and more like the components to build an industrial-grade blast-furnace - which has access doors as part of its design. You are expected to put the interlocks in.
Without the analogy, the way you do this on AWS is:
1. Set up an SNS queue
2. Set up AWS budget notifications to post to it
3. Set up a lambda that watches the SNS queue
And then in the lambda you can write your own logic which is smart: shut down all instances except for RDS, allow current S3 data to remain there but set the public bucket to now be private, and so on.
The obvious reason why "stop all spending" is not a good idea is that it would require things like "delete all my S3 data and my RDS snapshots" and so on which perhaps some hobbyist might be happy with but is more likely a footgun for the majority of AWS users.
In the alternative world where the customer's post is "I set up the AWS budget with the stop-all-spending option and it deleted all my data!" you can't really give them back the data. But in this world, you can give them back the money. So this is the safer one than that.
Data transfer can be pulled into the same model by having an alternate internet gateway model where you pay for some amount of unmetered bandwidth instead of per byte transfer, as other providers already do.
And why is that a problem? And how different is that from "forgetting" to pay your bill and having your production environment brought down?
AWS will remind you for months before they actually stop it.
When my computer runs out of hard drive it crashes, not goes out on the internet and purchases storage with my credit card.
It is technically impossible. In that no tech can fix the greed of the people taking these decisions.
> No cloud provides wants to give their customers that much rope to hang themselves with.
They are so benevolent to us...
Since there are in fact two ropes, maybe cloud providers should make it easy for customers to avoid the one they most want to avoid?
1) you hit the cap 2) aws sends alert but your stuff still runs at no cost to you for 24h 3) if no response. Aws shuts it down forcefully. 4) aws eats the “cost” because lets face it. It basically cost them 1000th of what they bill you for. 5) you get this buffer 3 times a year. After that. They still do the 24h forced shutdown but you get billed. Everybody wins.
Conversely the first time someone hits an edge case in billing limits and their site goes down, losing 10k worth of possible customer transactions there's no way to unring that bell.
The second constituency are also, you know, the customers with real cloud budgets. I don't blame AWS for not building a feature that could (a) negatively impact real, paying customers (b) is primarily targeted at people who by definition don't want to pay a lot of money.
But an opt in „id rather you deleting data/disable than send me a 100k bill“ toggle with suitable disclaimers would mean people can safely learn.
Thats way everyone gets what they want. (Well except cloud provider who presumably don’t like limits on their open ended bills)
But hey, let's say you have different priorities than me. Then why not bot? Why not let me set the hard cap? Why Amazon insists on being able to bill me on more than my business is worth if I make a mistake?
But over the last few years, people have convinced themselves that the cost of ignorance is low. Companies hand out unlimited self-paced learning portals, tick the “training provided” box, and quietly stop validating whether anyone actually learned anything.
I remember when you had to spend weeks in structured training before you were allowed to touch real systems. But starting around five or six years ago, something changed: Practitioners began deciding for themselves what they felt like learning. They dismantled standard instruction paths and, in doing so, never discovered their own unknown unknowns.
In the end, it created a generation of supposedly “trained” professionals who skipped the fundamentals and now can’t understand why their skills have giant gaps.
The expectation that it just works is mostly a good thing.
Not if its an Airbus A220 or similar. They made it easy to take off, but it is still a large commercial aircraft...easy to fly...for pilots...
Also, consider using fck-nat (https://fck-nat.dev/v1.3.0/) instead of NAT gateways unless you have a compelling reason to do otherwise, because you will save on per-Gb traffic charges.
(Or, just run your own Debian nano instance that does the masquerading for you, which every old-school Linuxer should be able to do in their sleep.)
It's annoying because this is by far the more uncommon case for a VPC, but I think it's the right way to structure, permissions and access in general. S3, the actual service, went the other way on this and has desperately been trying to reel it back for years.
A parallel to this is how SES handles permission to send emails. There are checks and hoops to jump through to ensure you can't send out spam. But somehow, letting DevOps folk shoot themselves in the foot (credit card) is ok.
What has been done is the monetary equivalent of "fail unsafe" => "succeed expensively"
Then it is still blocked unless you add a NAT gateway or Internet gateway to the VPC and at a route to them.
If you are doing all of this via IAC, you have to take a lot of steps to make this happen. On the other hand, if I’m using an EC2 instance to run an ETL job from data stored on S3, I’m not putting that EC2 instance in a subnet with internet access in the first place. Why would I?
And no you don’t need internet access to access the EC2 instance ftom your computer even without a VPN. You use System Manager Session Manager.
I do the same with lambda - attach then to a VPC without internet access with the appropriate endpoints. Even if they are serving an API, they are still using an API gateway
My point is that, architecturally, is there ever in the history of AWS an example where a customer wants to pay for the transit of same-region traffic when a check box exists to say "do this for free"? Authorization and transit/path are separate concepts.
There has to be a better experience.
https://docs.aws.amazon.com/vpc/latest/userguide/egress-only...
And if there are any interoperability concerns, you offer an ability to opt-out with that (instead of opting in).
There is precedent for all of this at AWS.
This is breaking existing IAAC configurations because they rely on the default. You will never see the change you're describing except in security-related scenarios
> There is precedent for all of this at AWS.
Any non-security IAAC default changes you can point to?
Why it should not be done:
1. It mutates routing. Gateway Endpoints inject prefix-list routes into selected route tables. Many VPCs have dozens of RTs for segmentation, TGW attachments, inspection subnets, EKS-managed RTs, shared services, etc. Auto-editing them risks breaking zero-trust boundaries and traffic-inspection paths.
2. It breaks IAM / S3 policies. Enterprises commonly rely on aws:sourceVpce, aws:SourceIp, Private Access Points, SCP conditions, and restrictive bucket policies. Auto-creating a VPCE would silently bypass or invalidate these controls.
3. It bypasses security boundaries. A Gateway Endpoint forces S3 traffic to bypass NAT, firewalls, IDS/IPS, egress proxies, VPC Lattice policies, and other mandatory inspection layers. This is a hard violation for regulated workloads.
4. Many VPCs must not access S3 at all. Air-gapped, regulated, OEM, partner-isolated, and inspection-only VPCs intentionally block S3. Auto-adding an endpoint would break designed isolation.
5. Private DNS changes behavior. With Private DNS enabled, S3 hostname resolution is overridden to use the VPCE instead of the public S3 endpoint. This can break debugging assumptions, routing analysis, and certain cross-account access patterns.
6. AWS does not assume intent. The VPC model is intentionally minimal. AWS does not auto-create IGWs, NATs, Interface Endpoints, or egress paths. Defaults must never rewrite user security boundaries.
“We have no idea what your intent is, so we’ll default to routing AWS-AWS traffic expensively” is way, way worse than forcing users to be explicit about their intent.
Minimal is a laudable goal - but if a footgun is the result then you violate the principle of least surprise.
I rather suspect the problem with issues like this is that they mainly catch the less experienced, who aren’t an AWS priority because they aren’t where the Big Money is.
How are you inspecting zero-trust traffic? Not at the gateway/VPC level, I hope, as naive DPI there will break zero-trust.
If it breaks closed as it should, then it is working as intended.
If it breaks open, guess it was just useless pretend-zero-trust security theatre then?
And ok, this is a mistake you will probably only make once - I know, because I too have made it on a much smaller scale, and thankfully in a cost-insensitive customer's account - but surely if you're an infrastructure provider you want to try to ensure that you are vigilantly removing footguns.
It's a free service after all.
Oof, this hit home, hah.
Only half-joking. When something grossly underperforms, I do often legitimately just pull up calc.exe and compare the throughput to the number of employees we have × 8 kbit/sec [0], see who would win. It is uniquely depressing yet entertaining to see this outperform some applications.
[0] spherical cow type back of the envelope estimate, don't take it too seriously; assumes a very fast 200 wpm speech, 5 bytes per word, and everyone being able to independently progress
I uploaded a small xls with uid and prodid columns and then kind of forgot about it.
A few months later I get a note from bank saying your account is overdrawn. The account is only used for freelancing work which I wasn't doing at the time, so I never checked that account.
Looks like AWS was charging me over 1K / month while the algo continuously worked on that bit of data that was uploaded one time. They charged until there was no money left.
That was about 5K in weekend earnings gone. Several months worth of salary in my main job. That was a lot of money for me.
Few times I've felt so horrible.
And of course I give every online service a separate virtual credit card (via privacy dot com, but your bank may issue them directly) with a spend limit set pretty close to the expected usage.
I have never understood why the S3 endpoint isn't deployed by default, except to catch people making this exact mistake.
"I'd like to spend the next sprint on S3 endpoints by default"
"What will that cost"
"A bunch of unnecessary resources when it's not used"
"Will there be extra revenue?"
"Nah, in fact it'll reduce our revenue from people who meant to use it and forgot before"
"Let's circle back on this in a few years"
BTW you can of course self-host k8s, or dokku, or whatnot, and have as easy a deployment story as with the cloud. (But not necessarily as easy a maintenance story for the whole thing.)
Cloud cult was successfully promoted by all major players, and people have completely forgotten about the possibilities of traditional hosting.
But when I see a setup form for an AWS service or the never-ending list of AWS offerings, I get stuck almost immediately.
I get pulled into a fair number of "why did my AWS bill explode?" situations, and this exact pattern (NAT + S3 + "I thought same-region EC2→S3 was free") comes up more often than you’d expect.
The mental model that seems to stick is: S3 transfer pricing and "how you reach S3" pricing are two different things. You can be right that EC2→S3 is free and still pay a lot because all your traffic goes through a NAT Gateway.
The small checklist I give people:
1. If a private subnet talks a lot to S3 or DynamoDB, start by assuming you want a Gateway Endpoint, not the NAT, unless you have a strong security requirement that says otherwise.
2. Put NAT on its own Cost Explorer view / dashboard. If that line moves in a way you didn’t expect, treat it as a bug and go find the job or service that changed.
3. Before you turn on a new sync or batch job that moves a lot of data, sketch (I tend to do this with Mermaid) "from where to where, through what, and who charges me for each leg?" It takes a few minutes and usually catches this kind of trap.
Cost Anomaly Detection doing its job here is also the underrated part of the story. A $1k lesson is painful, but finding it at $20k is much worse.
Crucial for the approval was that we had cost alerts already enabled before it happened and were able to show that this didn't help at all, because they triggered way too late. We also had to explain in detail what measures we implemented to ensure that such a situation doesn't happen again.
Do you just delete when the limit is hit?
s/everyone has/a bunch of very small customers have/
A bunch of data went down the "wrong" pipe, but in reality most likely all the data never left their networks.
Hard no. Had to pay I think 100$ for premium support to find that out.
How does this actually work? So you upload your data to AWS S3 and then if you wish to get it back, you pay per GB of what you stored there?
Though important to note in this specific case was a misconfiguration that is easy to make/not understand in the data was not intended to leave AWS services (and thus should be free) but due to using the NAT gateway, data did leave the AWS nest and was charged at a higher data rate per GB than if just pulling everything straight out of S3/EC2 by about an order of magnitude (generally speaking YMMV depending on region, requests, total size, if it's an expedited archival retrieval etc etc)
So this is an atypical case, doesn't usually cost $1000 to pull 20TB out of AWS. Still this is an easy mistake to make.
And people wonder why Cloudflare is so popular, when a random DDoS can decide to start inflicting costs like that on you.
But “security” people might say. Well, you can be secure and keep the behavior opt out, but you should be able to have an interface that is upfront and informs people of the implications
You can see why, from a sales perspective: AWS' customers generally charge their customers for data they download - so they are extracting a % off that. And moreover, it makes migrating away from AWS quite expensive in a lot of circumstances.
Please get some training...and stop spreading disinformation. And to think on this thread only my posts are getting downvoted....
"Free data transfer out to internet when moving out of AWS" - https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...
In the link you posted, it even says Amazon can't actually tell if you're leaving AWS or not so they're going to charge you the regular rate. You need explicit approval from them to get this 'free' data transfer.
People are trying to tell you something with the downvotes. They're right.
Egress bandwidth costs money. Consumer cloud services bake it into a monthly price, and if you’re downloading too much, they throttle you. You can’t download unlimited terabytes from Google Drive. You’ll get a message that reads something like: “Quota exceeded, try again later.” — which also sucks if you happen to need your data from Drive.
AWS is not a consumer service so they make you think about the cost directly.
Sure you can just ram more connections through the lossy links from budget providers or use obscure protocols, but there's a real difference.
Whether it's fairly priced, I suspect not.
We are programmed to receive. You can check out any time you like, but you can never leave
I was lucky to have experienced all of the same mistakes for free (ex-Amazon employee). My manager just got an email saying the costs had gone through the roof and asked me to look into it.
Feel bad for anyone that actually needs to cough up money for these dark patterns.
Sure, it decreases the time necessary to get something up running, but the promises of cheaper/easier to manage/more reliable have turned out to be false. Instead of paying x on sysadmin salaries, you pay 5x to mega corps and you lose ownership of all your data and infrastructure.
I think it's bad for the environment, bad for industry practices and bad for wealth accumulation & inequality.
Assuming he got it working he could have opened service without directly going further in debt with the caviat that if he messed up the pricing model, and it took off, it could have annihilated his already dead finances.
AWS just yesterday launched flat rate pricing for their CDN (including a flat rate allowance for bandwidth and S3 storage), including a guaranteed $0 tier. It’s just the CDN for now, but hopefully it gets expanded to other services as well.
- raw
- click-ops
Because, when you build your infra from scratch on AWS, you absolutely don't want the service gateways to exist by default. You want to have full control on everything, and that's how it works now. You don't want AWS to insert routes in your route tables on your behalf. Or worse, having hidden routes that are used by default.
But I fully understand that some people don't want to be bothered but those technicalities and want something that work and is optimized following the Well-Architected Framework pillars.
IIRC they already provide some CloudFormation Stacks that can do some of this for you, but it's still too technical and obscure.
Currently they probably rely on their partner network to help onboard new customers, but for small customers it doesn't make sense.
Why? My work life is in terraform and cloudformation and I can't think of a reason you wouldn't want to have those by default. I mean I can come up with some crazy excuses, but not any realistic scenario. Have you got any? (I'm assuming here that they'd make the performance impact ~0 for the vpc setup since everyone would depend on it)
If I declare two aws_route resources for my route table, I don't want a third route existing and being invisible.
I agree that there is no logical reason to not want a service gateway, but it doesn't mean that it should be here by default.
The same way you need to provision an Internet Gateway, you should create your services gateways by yourself. TF modules are here to make it easier.
Everything that comes by default won't appear in your TF, so it becomes invisible and the only way to know that it exists is to remember that it's here by default.
If you don't have a specific need for a specific service they are offering stay away, it's a giant ripoff.
If you need generic stuff like VMs, data storage, etc. You are much better of using Hetzner, OVH, etc, and some standalone CDN if you need one.
We are primarily using Hetzner for the self-serve version of Geocodio and have been a very happy customer for decades.
The key is, do not make decisions lightly in the cloud, just because something is easy to enable in the UI does not mean it's recommended. Sit down with the pricing page or calculator and /really/ think over your use case. Get used to thinking about your infrastructure in terms of batch jobs instead of real time and understand the implementation and import of techniques like "circuit breakers."
Once you get the hang of it it's actually very easy and somewhat liberating. It's really easy to test solutions out in a limited form and then completely tear them down. Personally I'm very happy that I put the effort in.
Actually I am not willing to spend that much time reading "cloud" provider docs and best practices.
What I care about is hosting services and getting the mission done. I value predictable and reasonable costs much more than "flexibility".
In general whatever host I have ever used typically have "cloud" offerings too. So real services go to dedicated hosts, experiments go into 5 buck a month vms.
It doesn't take that much time just intentional reading to learn something new. Do you not read any docs?
> I value predictable and reasonable costs much more than "flexibility".
The point is that these aren't mutually exclusive.
> So real services go to dedicated hosts
And when it becomes overloaded, needs upgrading, or requires other maintenance? There's a problem I've completely left behind.
I learned AWS the same way most "bootstrapped" people do, with the free tier. Maybe it's more of a minefield than it was a decade ago.
This should be illegal. If you can't inform me about the bill on my request you shouldn't be legally able to charge me that bill. Although I can already imagine plenty of ways somebody could do malicious compliance with that rule.
But of course, the incentive to optimize this is not there.
It's all in the docs: https://docs.aws.amazon.com/vpc/latest/privatelink/concepts....
>There is another type of VPC endpoint, Gateway, which creates a gateway endpoint to send traffic to Amazon S3 or DynamoDB. Gateway endpoints do not use AWS PrivateLink, unlike the other types of VPC endpoints. For more information, see Gateway endpoints.
Even the first page of VPC docs: https://docs.aws.amazon.com/vpc/latest/userguide/what-is-ama...
>Use a VPC endpoint to connect to AWS services privately, without the use of an internet gateway or NAT device.
The author of the blog writes:
> When you're using VPCs with a NAT Gateway (which most production AWS setups do), S3 transfers still go through the NAT Gateway by default.
Yes, you are using a virtual private network. Where is it supposed to go? It's like being surprised that data in your home network goes through a router.
I think it's okay if someone missed something in the docs and wanted to share from their experience. In fact, if you look at the the s3 pricing page [0], under Data Transfer, VPC endpoints are mentioned at all. It simply says data transfer is free between AWS services in the same region. I think that much detail would be enough to reasonably assume you didn't have to set up additional items to accomplish.
The solution is to move your processing infrastructure to Hetzner.
Make sure they go to an list with multiple people on it. Make sure someone pays attention to that email list.
It's free and will save your bacon.
I've also had good luck asking for forgiveness. One time I scaled up some servers for an event and left them running for an extra week. I think the damage was in the 4 figures, so not horrendous, but not nothing.
An email to AWS support led to them forgiving a chunk of that bill. Doesn't hurt to ask.
Personally I miss ephemeral storage - having the knowledge that if you start the server from a known good state, going back to that state is just a reboot away. Way back when I was in college, a lot of out big-box servers worked like this.
You can replicate this on AWS with snapshots or formatting the EBS volume into 2 partitions and just clearing the ephemeral part on reboot, but I've found it surprisingly hard to get it working with OverlayFS
And then writing “I regret it” posts that end up on HN.
Why are people not getting the message to not use AWS?
There’s SO MANY other faster cheaper less complex more reliable options but people continue to use AWS. It makes no sense.
- DevOopsUnexpected, large AWS charges have been happening for so long, and so egregiously, to so many people, including myself, that we must assume it's by design of Amazon.
Also as a shameless plug: Vantage covers this is exact type of cost hiccup. If you aren't already using it, we have a very generous free tier: https://www.vantage.sh/
It really shows the Silicon Vally disconnect with the real world, where money matters.
And I can see how, in very big accounts, small mistakes on your data source when you're doing data crunching, or wrong routing, can put thousands and thousands of dollars on your bill in less than an hour.
--
0: https://blog.cloudflare.com/aws-egregious-egress/By default a NGW is limited to 5Gbps https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway...
A GB transferred through a NGW is billed 0.05 USD
So, at continuous max transfer speed, it would take almost 9 hours to reach $1000
Assuming a setup in multi-AZ with three AZs, it's still 3 hours if you have messed so much that you can manage to max your three NGWs
I get your point but the scale is a bit more nuanced than "thousands and thousands of dollars on your bill in less than an hour"
The default limitations won't allow this.
Let's say they decide to recalculate or test a algorithm: they do parallel data loading from the bucket(s), and they're pulling from the wrong endpoint or region, and off they go.
And maybe they're sending data back, so they double the transfer price. RDS Egress. EC2 Egress. Better keep good track of your cross region data!