The phenomenon described by the author can lead to interesting social dynamics over time. The initial designer of a system understands the latency/utilization tradeoff and dimensions the system to be underutilized so as to meet latency goals. Then the system is launched and successful, so people start questioning the low utilization, and apply pressure to increase utilization in order to reduce costs. Invariably latency goes up, customers complain. Customers escalate, and projects are started to reduce latency. People screw around at the margin changing number of threads etc, but the fundamental tradeoff cannot be avoided. Nobody is happy in the end. (Been through this cycle a few times already.)
There's a great book called Principles of Product Development Flow. It carefully looks at the systems behind how things get built. Key to any good feedback loop is low latency. So if we want our software to get better for users over time, low latencies from idea to release are vital. But most software processes are tuned for keeping developers 100% busy (or more!), which drastically increases system latency. That latency means we get a gain in efficiency (as measured by how busy developers are) but a loss in how effective the system is (as determined by creation of user and business value).
As teams get pushed to efficiently utilize scarce and expensive developer resources to their max they can also end up with huge latency issues for unanticipated requests. Not always easy to justify why planned work is way under a team's capacity though even if it leads to better overall outcomes at the end of the day.
As another abstract example that's completely disconnected from the real world: if we're running the world's capacity to make n95 masks at high utilization, it may take a while to be able to handle a sudden spike in demand.
But this:
> If you must use latency to measure efficiency, use mean (avg) latency. Yes, average latency
Not sure if I ever thought about it before, but after following the link[1] where OP talks more about it, they've convinced me. Definitely want mean latency at least in addition to median, not median alone.
There was an interesting article here not long ago that made the point that median is basically useless. If you load 5 resources on a page load, the odds of all of them being faster than the median (so it represents the user experience) is about 3%. You need a very high rank to get any useful information, probably with a number of 9s.
Problem is, sometimes system engineers do not know what to expect, but they still need to have a plan for this case.
The majority of computer systems do not deal with high utilization. As has been pointed out many times, computers are really fast these days, and many businesses may be able to get away through their entire lifetime on a single machine if the underlying software makes efficient use of the hardware resources. And yet even with low utilization, we still have occasional high latency that still occurs often enough to frustrate a user. Why is that? Because a lot of software these days is based on a design that intersperses low-latency operations with occasional high-latency ones. This shows up everywhere: garbage collection, disk and memory fragmentation, growable arrays, eventual consistency, soft deletions followed by actual hard deletions, etc.
What this article is advocating for is essentially an amortized analysis of throughput and latency, in which case you do have a nice and steady relationship between utilization and latency. But in a system which may never come close to full utilization of its underlying hardware resources (which is a large fraction of software running on modern hardware), this amortized analysis is not very valuable because even with very low utilization we can still have very different latency distributions due to the aforementioned software design and what tweaks you make to that.
This is why many software systems don't care about the median latency or the average latency, but care about the 99 or 99.9 percentile latency: there is a utilization-independent component to the statistical distribution of your latency over time and for those many software systems which have low utilization of hardware resources that is the main determinant of your overall latency profile, not utilization.
As an oversimplified example, suppose that your system is 10% utilized and that $BAD_THING (gc, or whatever) happens that effectively slows down the system by a factor of 10 at least temporarily. Your latency does not go up by 10x---it grows unbounded because now your effective utilization is 100%.
>> If you must use latency to measure efficiency, use mean (avg) latency. Yes, average latency
What is wrong with measuring latency at 99.99 percentile with a clear guideline that optimising efficiency ( in this article higher utilisation ) should not have trade off on latency?
Because latency is part of user experience. And UX comes first before anything else.
Or does it imply that there are lot of people who dont know the trade off between latency and utilisation? Because I dont know anyone who has utilisation to 1 or even 0.5 in production.
But p99 is just one summary statistic. Most importantly, it's a robust statistic that rejects outliers. That's a very good thing in some cases. It's also a very bad thing if you care about throughput, because throughput is proportional to 1/latency, and if you reject the outliers then you'll overestimate throughput substantially.
p99 is one tool. A great and useful one, but not for every purpose.
> Because I dont know anyone who has utilisation to 1 or even 0.5 in production.
Many real systems like to run much hotter than that. High utilization reduces costs, and reduces carbon footprint. Just running at low utilization is a reasonable solution for a lot of people in a lot of cases, but as margins get tighter and businesses get bigger, pushing on utilization can be really worthwhile.
- p50, to see the baseline
- p95, to see the most common latency peaks
- p99, to see what the "normal" waiting times under load were
- max, because that's what the most unfortunate customers experienced
In a normal distributed system the spread between p99 and max can be enormous, but the mental mindset of ensuring smooth customer experience, with awareness that a real person had to wait that long, is exceptionally useful. You need just one slightly slower service for the worst-case latency to skyrocket. In particular, GraphQL is exceptionally bad at this without real discipline - the minimum request latency is dictated by the SLOWEST downstream service.To be fair, it was a real time gambling operation. And we were operating within the first Nielsen threshold.
ß: bucketing these by request route was quite useful.
EDIT: formatting
Let's take a median, which is also an order statistics. And a sequence of latency measurements: 0.005 s, 0.010 s, 3600 s. Median latency is 0.010 s, and this number does not tell how bad latency can actually be. Mean latency is 1200.05 s, which is more indicative how bad the worst case is.
In other words, percentiles show how often a problem happens (does not happen). Mean values show the impact of the problem.
mean and all it’s variants won’t show you a number that someone in your system necessarily experienced, but it will incorporate the entire distribution and show when it has changed.
in the context of efficiency, mean is beneficial because it can be used to measure concurrency in a system via littles law, and will signal changes in your concurrency that a percentile metric won’t necessarily do.
In my experience, if you want monitoring (or measuring for performance) to provide any value what so ever, you must measure multiple different aspects of the system all at once. Percentiles, averages, load, responses, i/o, memory, etc etc.
The only time you would need a single metric would possibly be for alerting, and a good alert (IMHO) is one that triggers for impending doom, which the article states percentiles are good for. But I think alerts are outside of the scope of this article.
TLDR; Review of the article: `Duh`
> I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I lead engineering on AWS Lambda and our other serverless products. Before that, I worked on EC2 and EBS.