CloudSQL Postgres is running with a misconfigured OS OOM killer, crashes Postmaster randomly even if memory use is below instance spec. GCP closes this bug report as "Won't fix".
This is a priority 1 issue. Seeing a wontfix for this has completely destroyed my trust of their judgement. The bug report states that they have been in contact with support since February.
Unbelievable attitude towards fixing production critical problems of their platform affecting all customers.
My current workplace uses GCP, my last workplace used AWS, and personally I’ve found AWS to have much higher average quality. At my current workplace we’ve stopped using Cloud SQL, and moved our Postgres usage to Aiven (with VPC peering). Aiven seem to do a much better job operating Postgres than GCP do.
Basically, their Cloud Tracing product is broken for modern Node/Postgres (in terms of showing PG queries and whatnot in traces), users have found the issue (and a seemingly super simple fix), but it’s been over a year and Google still haven’t fixed it. Google’s response is “yeah, we know pretty core functionality of this product is broken, but we’re not fixing it in the near future.” Or maybe ever? Many of their products feel semi-abandoned like this, especially in their observably stack - major bugs and/or performance issues that they never fix, and extremely limited features.
Cloud SQL isn’t terrible, but at least the Postgres version is one of the weaker managed Postgres offerings out there. And their whole observability stack (Logging/Monitoring/Tracing/Error Reporting) is legit terrible compared to competing products. Compared to other products I’ve used in the space, Cloud Logging is unbelievably worse than Sumo Logic, Cloud Metrics soooo much worse than Grafana+Prometheus, Cloud Tracing way worse than offerings from Datadog or New Relic, Cloud Error Reporting is ridiculously far behind Sentry, etc.
The GCP options are often quite cheap, but it shows in their extremely limited features, poor performance and plentiful bugs. Go with GCP for the things they do well, but don’t bother adopting their solution for everything simply to stick with one platform, as so many of their products are just so poor compared to competitors.
Google isn't in the business of selling things to end users, they're in the business of selling ads. The only thing GCP gives them (outside of getting wall streeters off their backs a few years ago when everyone and their brother was starting a cloud service) is a credit to their own infrastructure cost by selling excess to random joes.
Therefore I'm not surprised that AWS continues to be the defacto, they do sell things to end users. I'm not surprised that Azure is growing quickly, either, since MS also sells things to end users and they needed a way to transition their on-premise stuff to the wires.
For the most part it works okay and is fine, but there have definitely been a fair number of quirks..
https://issuetracker.google.com/u/2/savedsearches/559773?pli...
Now the issue is just in limbo and the only one who feels the pain is the customer.
I've observed with with Atlassian where I wanted to report a Jira bug, but found that it had already been opened some years before, more than a hundred people had subscribed, bug was still closed as "no activity, must not be relevant". I just found the exact same bug reported for Jira Cloud (I had observed it in the on-prem version): https://jira.atlassian.com/browse/JSWCLOUD-8865 and it was closed there for the very same reason.
I didn't leave a comment because the original report described the issue perfectly, and adding a "me too" comment is just noise in the bug tracker. Guess I'll be noise in future :-(
Seconded. Responsive support too.
I'd consider Aiven if I were still on GCP and looking for a solid managed Postgres provider. As it is, I'm now on DigitalOcean and fairly happy with their managed Postgres offering, but there are a few rough edges so I'm actually still looking at Aiven even though everything else I have is on DO...
* I had some compute servers that were up for 200 days. The customers noticed that they were half as fast as identical hardware just booted. Dropping the file system cache ("echo 3 | sudo dd of=/proc/sys/vm/drop_cache") brought the speed back up to the newly deployed servers. WTF? File system caches are supposed to be zero cost discards as soon as processes ask for RAM - but something else is going on. I suspect the kernel is behaving badly with overpopulated RAM management data (TLB entries?), but I don't know how to measure that.
* If that is actually the problem, then a solution might be to decrease data size by using non-zero hugepages ("cat /proc/sys/vm/nr_hugepages"). I'd love to see recommendations on when to use that.
For other processes you'll need a hugepage-aware allocator such as tcmalloc (the new one, not the old one) and transparent hugepages enabled. Again, the benefits of this may be enormous, if page table management is expensive on your services.
You will find a great deal of blogs on the web recommending disabling transparent hugepages. These people are all mislead. Hugepages are a major benefit.
For workload using forking and CoW sharing like Redis or CRuby it negates the entire benefit of CoW since flipping a single bit copies the entire huge page.
It surprised me because I had never executed a query and caused the whole host to crash up until that point - now I'm wondering if this misconfiguration is the cause
Also, linux's forking model can result in a lot of virtual memory being allocated if a heavy-weight program tries to fork+exec a lot of smaller programs, since fork+exec it not atomic and briefly doubles the virtual memory usage of the original program.
I think there are better ways to spawn programs that don't suffer from this problem now...
If you have programs that are written to allocate virtual memory sparingly (like postgres) then that should be fine.
However, there is a second way you can be caught out: even if you disable overcommit, your program can still be OOM killed for violating cgroup limits, since cgroup limits always behave as though over-commit is enabled (ie. they allow you to allocate more than you are allowed, and then you get OOM killed when you try to use the allocated memory). This means you'd have to be really careful running eg. postgres inside a kubernetes pod.
This behaviour really sucks IMO. I would like it if you could set overcommit on a per-program basis, so that eg. postgres can say "I know what I'm doing - when I allocate virtual memory I want you to really allocate it (and tell me now if you can't...)". I think you can somewhat achieve this with memory locking, but that prevents it from being paged out at all...
Consider this scenario: a process runs a fork(), and shortly after it runs an exec(). Normally, the extra fork only uses a tiny amount of extra memory, because the memory is shared between the parent and the child, until one of them writes to it (copy-on-write).
With overcommit disabled, the kernel must reserve enough space to copy the whole writable RAM of a process when it forks.
So you have a 16GB machine, and an 8.1GB process cannot spawn any other program through the usual fork + exec routine (workarounds exist, like forking before allocating lots of memory and using IPC to instruct the low-memory fork to fork again and launch, but that's way more complicated than a simple fork + exec).
So if you have a dedicated DB host and you know that your DB engine is very carefully engineered to work with disabled overcommit, you can disable it. On a general-purpose machine a disabled overcommit will waste lots of RAM that's sitting unused.
Even those programs that are “malloc(2) error aware”, often do something stupid and counterproductive in response, like attempting to allocate more memory for an exception object / stack trace / error string.
Programs that do something useful in response to a NULL malloc(2) return result — useful for the stability of the system as a whole, better than what the OOM killer gets you — are rare, even on servers. Usually it’s only stateful, long-running, DBMS-like daemons that 1. bother, and 2. have the engineering effort put into them to do the right thing.