No, this is not obvious. If you have a fully concurrent GC then spending 25 out of 1000 CPU cycles on memory management does not "obviously" have an impact on your 99th percentile latency. It would primarily impact your throughput (by 2.5%), just like any other thing consuming CPU cycles.
> We defined a metric called GC stall percentage to measure the percentage of time a Cassandra server was doing stop-the-world GC (Young Gen GC) and could not serve client requests.
Again, this metric doesn't tell you anything if you don't know how long each of the pauses are. If they are at the limit infinitesimally small then you are again only measuring the impact on throughput, not latency.
Certainly, GCs with long STW pauses do impact latency, but then you need to measure histograms of absolute pause times, not averages of ratios relative to application time. That's just a silly metric.
And neither does the article mention which JVM or GC they're using. Absent further information they might have gotten their 10x improvement relative to some especially poor choice of JVM and GC.
It's pretty much accepted everywhere that GCs perform terribly for databases. Modern GCs are great at handling small, very short-lived memory allocations, and that's about it. Just about any other workload and manual memory management ends up being a much better use of your time than GC tuning.
That is not a given. And, even distribution is only part of the equation. If they are sufficiently short, then even being somewhat unevenly distributed should not have much of an impact on latency. For example, if the max length of a pause were 1ms, and 99p latency were 15ms, you'd have to be fairly unlucky to see a 33% increase in latency99 due to GC. That would entail 5 of 25 pauses happening during a 20ms period in a 1s window.
(This idea is not purely hypothetical. For example, Go's GC has very low STW periods.)
> It's pretty much accepted everywhere
Eh. Apparently everyone thinks C is the best language for cryptography and other secure but not particularly perf sensitive code. Go figure. Sometimes the wisdom of the masses is not wisdom. Best not to appeal to it during argumentation.
Cassandra is a constant struggle with the GC. I’d guess the cost of running it is at least an order of magnitude greater compared to if it had been implemented in c++ or something more sensible.
this thing you built and open sourced, has gotten you real measurable results? allow me to list the many ways you’re probably wrong and doing it incorrectly
I try to understand the meaning. Is it saying the latency caused be GC is applied to all requests, not just the ones that observe 99th percentile latency?
So yes, at the `infinitesimally small` end, time would be 'stolen' evenly from all request threads and would not be a contributing factor to the 99th percentile.
A concurrent GC spends CPU cycles on different cores to do its work, which means it will not cause latency outliers in the threads processing the requests. They are still CPU cycles you don't have to serve other requests, hence they still affect throughput.
That is a simplified explanation of course, there are a lot of caveats.
In my original post I was mostly speaking about the measurement though, since they are measuring throughput when they are concerned about latency, those are somewhat related but depending on circumstances only weakly so.
https://academy.datastax.com/content/distributed-data-show-e...
Is my guess here correct, or are there things I'm missing or mistaken on?
This hybrid approach gives the benefit of a managed runtime and safety of GC for most of your code, but allows the performance of raw pointers/malloc for key code paths.
Some examples of this pattern on the JVM:
- The Neo4j Page Cache, Muninn, https://github.com/neo4j/neo4j/blob/3.4/community/io/src/mai...
- The Netty projects implementation of jemalloc for the JVM: https://github.com/netty/netty/blob/4.1/buffer/src/main/java...
Everything is relative, I guess.
I wonder how long before we see a similar result from ElasticSearch. (Only other huge JVM based store I can think of).
We still have "other" things on-heap. The biggest contributor to GC pain tends to be the number of objects allocated on the read path, so this patch works around that by pushing much of that logic to rocksdb.
There are certainly other things you can do in the code itself that would also help - one of the biggest contributors to garbage is the column index. CASSANDRA-9754 fixes much of that (jira is inactive, but the development work on it is ongoing).
I wonder how the numbers would have looked with the new low latency GC for Hotspot (ZGC). https://wiki.openjdk.java.net/display/zgc/Main
Early results from SPECjbb2015 are impressive. https://youtu.be/tShc0dyFtgw?t=5m1s
When you already know Cassandra, and you already know RocksDB, and you already have an engineering team, it makes far more sense to combine the two things you know how to use at scale than to try to use some new thing NOBODY has run at scale.
Outbrain uses ScyllaDB in production at scale across multiple data centers. Not sure if it's Instagram scale, but still enough to prove it's reliability and performance.
https://www.outbrain.com/techblog/2016/08/scylladb-poc-not-s...
Scylla is AGPL for the OSS version though so testing it out would not be an option without getting a commercial license first.
Huh? The AGPL is not a non-commercial-use-only license.
If you have proprietary software that you would like to combine with AGPL code (i.e., not interact with as a service) and is available to the general public over the Internet, and you want keep your code proprietary, sure, you may not want to use the AGPL. But you could say the same thing about proprietary software you want to combine with GPL code and sell to the general public.
If you're either using the software through it's existing defined public interfaces, or you're okay releasing anything you modify or link into the software, the AGPL (and the GPL) are fine. Lots of people distribute proprietary products that include GPL code, like Chromebooks, Android phones, routers, GitHub Enterprise, etc. We figured out years ago that the Linux kernel is not just a non-commercial product. Why are we having the same misconceptions about the AGPL?
The issue isn't so much Java the language, as it is being aware of the GC, and developing with it in mind.
I don't believe one would need a commercial license just to test a product in any way? They are not making that part of any product at that point, so no concerns here.
Noone claims that a product using the MySql driver is a derivative work of the MySql server?
Edit: Of course, IANAL...
- how many experienced scylladb devops are there globally that we can hire?
Those questions asked at BigTechCo before it adopts somebody elses tech.
FB already operates RocksDb and Cassandra so there's way less technical, career, financial risk for just hacking the two together with some aggressive refactoring.
Any advice?
If it was as simple as tweaking a few GC settings to get 10x improvement pretty sure Datastax would've done it by now.
But, what? It turns out the article is really about replacing Java with C++.
They managed to generalize one application that meets their feature needs to be a front end to another existing application with fewer features but better performance as a back end. They're optimizing their hot path by decoupling it from the rest of the application and handing off to C++ code they didn't even have to write. Adding pluggable storage engines to Cassandra means that if they make the API smooth enough they can have engines in C, C++, Erlang, Go, Rust, ML, or whatever in the future without changing their front end. That's a big win even beyond this tail latency issue.
The GC problem is not limited to C*. This shit(virtual machine) is hitting the whole Hadoop stack: HDFS, Hive, Spark, Flink, Pig...
Immense number of tickets in any fairly large cluster is related somewhat to GC and JVM behavior.
What I took out of that is that I really feel like something like Cassandra is better suited to implementation in a language like C++ or Rust. And I believe others have since come along and done this.
I really liked the gossip-based federation in Cassandra though.
Since coming to Google I don't get the opportunity to compare/evaluate/deploy tools like this anymore. Smarter people than me make choices like that :-)
(I have used Azul in low-latency production environments. It has pros and cons but it certainly beats re-writing the storage layer... )
That's the biggest, especially for when it's a Facebook company. Otherwise it works well but can be pricey.
The JVM is getting a new fully concurrent collector though called Shenandoah: https://www.google.com/search?q=shenandoah+gc
Minor configuration issues (we had a very complex environment, custom kernel, weird network stuff, JNIs)
https://issues.apache.org/jira/browse/CASSANDRA-13474 [2 comments from 2017 Apr] https://issues.apache.org/jira/browse/CASSANDRA-13475 [~100 comments, but the last one is from 2017 Nov, by the InstaG engineer]
And the Rocksandra fork is already ~3500 commits behind master, so upstreaming this will be interesting.
Oh, and the Rocksandra fork is already kind of abandoned - no commits since 2017 Dec. (which probably means this is not actually the code that runs under Instagram.)
Umm .. shouldn't the stalls go to 0, because now you have moved to C++ ? Or is this the time it takes for the manual garbage collection to occur ?
Any thoughts on replacing HDFS + Yarn + Hive + HBASE with GulsterFS + Kubernetes + Cassandra
??