How percentile approximation works and why it's more useful than averages (opens in new tab)

(blog.timescale.com)

687 pointsod04y ago163 comments

163 comments

bqe4y ago

Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.

The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.

[1]: https://github.com/cxxr/LiveStats

[2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

cyral4y ago

There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)

[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest

djk4474y ago

NB: Post author here.

Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.

[1]: https://arxiv.org/abs/1908.10693

2 more replies

jeremysalwen4y ago

Not an expert on this topic but I noticed that the KLL algorithm (published in 2016) was not mentioned in this thread, which provides theoretically optimal performance for streaming quantiles with guaranteed worst case performance: http://courses.csail.mit.edu/6.854/20/sample-projects/B/stre... (And is pretty fast in practice).

1 more reply

riskneutral4y ago

Also relevant: Single-Pass Online Statistics Algorithms

[1] http://www.numericalexpert.com/articles/single_pass_stat/

breuleux4y ago

That's pretty neat! Can these be used to efficiently compute rolling percentiles (over windows of the data), or just incremental?

2 more replies

fotta4y ago

yep I had to implement t-digest in a monitoring library. another alternative (although older) that the prometheus libraries use is CKMS quantiles [0].

[0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-icde.pd...

convolvatron4y ago

i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.

1 more reply

tantalor4y ago

> Find the median ... randomly evict items

So, not find, but approximate. That's a different thing.

rendaw4y ago

> without sorting it... have a fixed size sorted buffer

(that you sort yourself)

1 more reply

djk4474y ago

NB: Post author here.

Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)

Cd00d4y ago

Did the candidate get an offer? Genuinely curious.

I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.

williamdclt4y ago

Whether they got it or not probably isn't useful information. Having a good/brilliant answer probably isn't the only point of the question, this probably wasn't the only question of the interview, and this probably wasn't the only interview

bo10244y ago

> if we estimate the median and move it a small amount for each new data point, it would be pretty close.

Yeah, this is gradient descent on the absolute loss.

eru4y ago

> The question was basically, "Find the median of a huge data set without sorting it," [...]

You can use eg QuickSelect (https://en.wikipedia.org/wiki/Quickselect) or Median of Medians.

They don't sort the data, but they do need linear amount of storage.

Izikiel434y ago

> The question was basically, "Find the median of a huge data set without sorting it,"

Isn't this done using a min heap and a max heap in conjuction?

ilikebits4y ago

The real constraint here is probably "find the median of a huge data set without holding the entire data set in memory".

1 more reply

tantalor4y ago

No, for 2 reasons,

1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"

2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules

2 more replies

hnfong4y ago

The two heap method helps you maintain the current median given an incoming stream of “online” data, and technically is not “sorting”.

The original problem is probably more accurately described as “what if you have too much data to sort” though.

5faulker4y ago

It's funny that this is often left out from a data & algorithm class.

2 more replies

ZephyrBlu4y ago

How does this get you the median?

abnry4y ago

One way to think about why we tend to use averages instead of medians is that it is related to a really deep theorem in probability: The Central Limit Theorem.

But I think we can twist our heads and see in a way that this is backwards. Mathematically, the mean is much easier to work with because it is linear and we can do algebra with it. That's how we got the Central Limit Theorem. Percentiles and the median, except for symmetric distributions, are not as easy to work with. They involve solving for the inverse of the cumulative function.

But in many ways, the median and percentiles are a more relevant and intuitive number to think about. Especially in contexts where linearity is inappropriate!

a-dub4y ago

i think of it as: if the data is gaussian, use a mean, otherwise go non-parametric (medians/percentiles).

or put another way, if you can't model it, you're going to have to sort, or estimate a sort, because that's all that's really left to do.

this shows up in things from estimating centers with means/percentiles to doing statistical tests with things like the wilcoxon tests.

lanstin4y ago

Assume up front none of your measured latencies from a software networked system will be Gaussian, or <exaggereation> you will die a painful death </exaggeration>. Even ping times over the internet have no mean. The only good thing about means is you can combine them easily, but since they are probably a mathematical fiction, combining them is even worse. Use T-Digest or one of the other algorithms being highlighted here.

2 more replies

feoren4y ago

> the median and percentiles are a more relevant and intuitive number to think about

A good example of this is when people say "more than half of drivers say they are above average drivers! Idiots!" Of course that's perfectly possible if most drivers are basically fine and some are really really bad. For example, 99.99% of people have greater than the average number of legs.

The correct impossible statement would be if more than half of drivers are better than the median driver.

nancarrow4y ago

It's more related to the law of large numbers than the CLT

rahimnathwani4y ago

For some things, you can't even sensibly measure the mean. For example, if you're measuring the mean response time for a service, a single failure/timeout makes the mean response time infinite (because 100 years from now the response still hasn't been received).

"Why Averages Suck and Percentiles are Great": https://www.dynatrace.com/news/blog/why-averages-suck-and-pe...

contravariant4y ago

It's indeed always worth pointing that a mean may or may not exist. Same with variances/standard-deviation.

The central limit theorem seems to have given people the slightly wrong idea that things will always average out eventually.

maxnoe4y ago

Coughs Cauchy

djk4474y ago

totally. that blog was also one of the sources I mentioned in the post! Good stuff

NB: Post author here.

hnuser1234564y ago

Gamers have an intuitive sense of this. Your average framerate can be arbitrarily high, but if you have a big stutter every second between the smooth moments, then a lower but more consistent framerate may be preferable, typically expressed as the 1% and 0.1% slowest frames, which at a relatively typical 100fps, represents the slowest frame every second and every 10 seconds.

tzs4y ago

I have no hope of finding a cite for this, but a long time ago I read some command line UI research that found if you had a system where commands ranged from instant to taking a small but noticeable time and you introduced delays in the faster commands to make it so all commands took the same small but noticeable time people would think that the system was now faster overall.

wruza4y ago

I guess that’s because our minds (and animal minds as well) are always aware of the pace of repetitive events. If something is off, the anxiety alarm rings. One old book on the brain machinery described an example of a cat that was relaxing near the metronome and became alert when it was suddenly stopped. Unpredictable delays are disturbing, because a mispredicted event means you may be in a dangerous situation and have to recalibrate now.

1 more reply

barrenko4y ago

Presumably, also the reason we have those fake queue lines at airports (not sure the correct word is for it).

1 more reply

djk4474y ago

NB: Post author here.

Love this example. Might have to use that in a future post. Feel like a lot of us are running into a similar thing with remote work and video calls these days...

elmalto4y ago

Similarly, I would rather have no wifi than bad/spotty wifi for this reason

account424y ago

For games, one compounding factor is that you need to estimate how long the next frame will take in order to know how much time to simulate. If the variance is greater, the prediction will be worse leading the perceived game "speed" to vary from frame to frame.

eric-hu4y ago

There was a “Latency” multiplayer setting in Starcraft 1. I think it was there precisely to give players control over smoothness of network latency instead of graphics and processing latency. I think this just adds yet another example of your observation.

slaymaker19074y ago

Yep, I often find it useful to limit framerate to a number I know my GPU can handle so that I don't get as much stuttering. It's better to run at 60 FPS all the time than 80 FPS most of the time but with stuttering.

exolymph4y ago

"Slow is smooth and smooth is fast"

kazinator4y ago

Gamers do not have a more intuitive sense for this than movie watchers, video/voice talkers, not to mention users who type text into bloated web browsers or lagged remote login sessions.

I would rather have a rock-steady consistent 500 ms reponse time when typing text, than to have a 100 ms average response time which randomly spikes to outliers that go past one second.

A rock-steady, though poor, event rate in a paint application is better for drawing a freehand curve (especially if the program interpolates well) than a really fast rate that suddenly has a glitch in it, spoiling your work.

spiderice4y ago

> Gamers do not have a more intuitive sense for this than movie watchers, video/voice talkers, not to mention users who type text into bloated web browsers or lagged remote login sessions

I would be shocked if gamers didn't on average have a more intuitive sense of this than any of the groups you mentioned.

1 more reply

hnuser1234564y ago

Maybe not all, but ones who create and consume analysis like this

https://www.techpowerup.com/review/msi-geforce-rtx-3090-gami...

michaelmdresser4y ago

Looking particularly at latency measurements, I found the "How NOT to Measure Latency" [1] talk very illuminating. It goes quite deep into discussing how percentiles can be used and abused for measurement.

[1]: https://www.infoq.com/presentations/latency-response-time/

lanstin4y ago

I watch this video once a year and send it to my co-workers whenever averages or medians shows up in a graph for public consumption.

peheje4y ago

Are the points written in a readable format anywhere?

1 more reply

rattray4y ago

What are some of the ways percentiles can be abused?

alexanderdmitri4y ago

# Power Statement

!*Plowser*!* is in the 99th percentile of browsers by global user count!

# Fact Sheet

- !*Plowser*! is used by 2,050 people

- sample size of browsers is 10,000 (this includes toy apps on GitHub and almost all browsers in the sample have no record of active users)

- those using !*Plowser*! have no choice as the developing company $*DendralRot Inc*$ forces all its employees, contractors and users of their enterprise '?_shuiteware_?' (a name derived by mashing |-->shite|software|suite<--| into one word) to use their browser

- if we place the number of global browser users at a conservative 1,000,000,000, !*Plowser*! actually has 0.00000205% of users

madars4y ago

Good opportunity to plug https://en.wikipedia.org/wiki/Anscombe%27s_quartet : if you don't know much about the underlying distribution, simple statistics don't describe it well.

From Wikipedia description: Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it, and the effect of outliers and other influential observations on statistical properties.

pdpi4y ago

There's a fun paper by Autodesk where they make datasets that look whatever way you want them to.

https://www.autodesk.com/research/publications/same-stats-di...

s_gourichon4y ago

Yes, a both funny and insightful lesson on how weak basic indicators (mean, standard deviation, correlation) can be, the interest and limits of box-plot with quartiles and whiskers, the benefit of the violin-plot.

Definitely worth a quick look.

djk4474y ago

NB: Post author here.

This is great! So fun...will have to use in the future...

1 more reply

doctorsher4y ago

This is excellent information, thank you for posting this! I was not familiar with this example previously, but it is a perfect example of summary statistics not capturing certain distributions well. It's very approachable, even if you had to limit the discussion to mean and variance alone. Bookmarked, and much appreciated.

djk4474y ago

NB: Post author here.

That's really nifty, wish I'd heard about it earlier. Might go back and add a link to it in the post at some point too! Very useful. Definitely know I wasn't breaking new ground or anything, but fun to see it represented so succinctly.

hoseja4y ago

>importance of graphing data before analyzing it

Very discouraging if one is trying to analyze data algorithmically. Often when faced with a problem in statistics, the answer is: "Look at the graph and use intuition!".

potatoman224y ago

Interesting, thanks for the concept. What's one to do for high dimensional data then?

tunesmith4y ago

I just recently tried giving a presentation to my department (they're developers, I'm architect) about this stuff and they all just sort of blinked at me. It brought in Little's Law and Kingman's Formula, in an attempt to underscore why we need to limit variation in the response times of our requests.

There are a bunch of queuing theory formulas that are really cool but don't exactly apply if your responses vary a lot like the article describes. I think the assumption is that response time distributions are exponential distributions, which I don't think is a good assumption (is an Erlang distribution an exponential distribution?) Nevertheless, hooking the equations up to some models is a good way to get directional intuition. I didn't realize how steep the performance drop-off is for server utilization until I started moving sliders around.

Our ops team doesn't really follow this either. We're not a huge department though - is this the kind of stuff that an SRE team usually pays attention to?

djk4474y ago

NB: Post author here.

I found it surprisingly difficult to explain well. Took a lot of passes and a lot more words than I was expecting. It seems like such a simple concept. I thought the post was gonna be the shortest of my recent ones, and then after really explaining it and getting lots of edits and rewriting, it was 7000 words and ...whoops! But I guess it's what I needed to explain it well (hope you thought so anyway).

It's somewhat exponential, but yeah, not necessarily, it's definitely long-tailed in some way and it sort of doesn't matter what the theoretical description is (at least in my mind) the point is these types of distributions really don't get described well by a lot of typical statistics.

Can't talk too much about SREs at the ad analytics company I mentioned, we were the backend team that wrote a lot of the backend stores / managed databases / ran the APIs and monitored this stuff a bit (and probably not all that well). It was a bit more ad hoc I guess, probably now that company is large enough they have a dedicated team for that...

tunesmith4y ago

The article hit the pocket pretty exactly for how I felt I needed to explain it. (Actually I had the same experience where I thought I would be able to go over it quickly and then I felt like I was getting mired.) The graphs look great too - I've been able to explore it pretty well with JupyterLab but I can't export that into something super-interactive that is read-only. I've thought about creating an idyll page out of it to help others explore but that's a bit overkill for the work style our team has.

I think the weirdly-shaped long-tail graphs we come across are just sums of more naturally-distributed response times, for different types of responses. Another reason to limit variation I think.

1 more reply

bluesmoon4y ago

This is exactly the algorithm we developed at LogNormal (now part of Akamai) 10 years ago for doing fast, low-memory percentiles on large datasets.

It's implemented in this Node library: https://github.com/bluesmoon/node-faststats

Side note: I wish everyone would stop using the term Average to refer to the Arithmetic mean. "Average" just means some statistic used to summarize a dataset. It could be the Arithmetic Mean, Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch of other statistics. We're stuck with AVG because that's the function used by early databases and Lotus 123.

JackFr4y ago

No we’re stuck with it because average was used colloquially for arithmetic mean for decades.

I wish people would stop bad-mouthing the arithmetic mean. If you have to convey information about a distribution and you’ve got only one number to do it, the arithmetic mean is for you.

jrochkind14y ago

I think it still depends on the nature of the data and your questions about it, and median is often the better choice if you have to pick only one.

1 more reply

bluesmoon4y ago

Yes, Lotus 123 came out 38 years ago :)

louisnow4y ago

```To calculate the 10th percentile, let’s say we have 10,000 values. We take all of the values, order them from largest to smallest, and identify the 1001st value (where 1000 or 10% of the values are below it), which will be our 10th percentile.```

Isn't this contradictory? If we order the values from largest to smallest and take the 1001st value, then 10 % of the values are above/larger and not below/smaller. I believe it should say order from smallest to largest.

djk4474y ago

NB: Post author here.

Oops, yep, that should probably be order from smallest to largest. Thanks for the correction!

djk4474y ago

Fixed!

emgo4y ago

Yes, this looks like a typo in the article. It should be smallest to largest.

LoriP4y ago

Thanks, we corrected this quite quickly. Appreciated!

tdiff4y ago

One of the difficulties when dealing with percentiles is estimating error of the estimated percentile values, which is not necessarily trivial compared to estimating error of mean approximation (e.g. see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6294150/)

satvikpendem4y ago

I often see HN articles crop up soon after a related post, in this case this Ask HN poster [0] being driven crazy by people averaging percentiles and them not seeing that it's a big deal. It's pretty funny to see such tuples of posts appearing.

https://news.ycombinator.com/item?id=28518795

ChuckMcM4y ago

This is a really important topic if you're doing web services. Especially if you're doing parallel processing services. When I was at Blekko the "interesting" queries were the ones above the 95th percentile because they always indicated "something" that hadn't worked according to plan. Sometimes it was a disk going bad on one of the bucket servers, sometimes it was a network port dropping packets, and sometimes it was a corrupted index file. But it was always something that needed to be looked at and then (usually) fixed.

It also always separated the 'good' Ad networks from the 'bad' ones as the bad ones would take to long to respond.

aidenn04y ago

Be careful translating percentiles of requests to percentiles of users; if less than 10% of your requests take over 1 second, but a typical user makes 10 requests, it's possible that the majority your users are seeing a request take over 1 second.

djk4474y ago

NB: Post author here.

Yep! Briefly noted that in the post, but deserves re-stating! it's definitely a more complex analysis to figure out the percentage of users affected (though often more important) could be majority could also be one user who has some data scientist programmatically making hundreds of long API calls for some task...(can you tell that I ran into that? Even worse it was one of our own data scientists ;) ).

axpy9064y ago

There’s something called a five number summary in statistics: mean, median, standard deviation, 25th percentile and 75th percentile.

The bonus is that the 75th - 50th gives you the interquartile range.

Mean is not a robust measure and as such you need to look at variety to truly understand the spread of your data.

10000truths4y ago

If you're going to use multiple quantities to summarize a distribution, wouldn't using percentiles for all of them give you the most information? The mean and standard deviation could then be estimated from that data.

bluesmoon4y ago

IQR is 75th - 25th, aka, the middle-50%

monkeydust4y ago

Ok this, box plots are a good way to visualize and show distribution esp to a not so stat heavy audience.

axpy9064y ago

You’re right I mistyped.

jonnydubowsky4y ago

I really enjoyed this post! The author also wrote an interactive demonstration of the concepts (using Desmos). It's super helpful.

https://www.desmos.com/calculator/ty3jt8ftgs

djk4474y ago

NB: Post author here.

Glad you liked it! I was so excited to actually be able to get to use Desmos for something work wise, I've been wanting to do it for years!

Lightbody4y ago

Whenever this topic comes up, I always encourage folks to watch this 2011 classic 15m talk at the Velocity conference by John Rauser:

https://www.youtube.com/watch?v=coNDCIMH8bk

varelaz4y ago

Percentiles for sure are better than average if you want to explore distribution: there are several percentiles comparing to single average. However median is harder to use if you want to do calculations based on this metric. For example distribution of sample median could be a problem, if you want to understand confidence interval for it for example.

djk4474y ago

NB: Post author here.

Totally can be true. In our case, we use these approximation methods that allow you to get multiple percentiles "for free" definitely need to choose the right ones for the job. (We talk a bit more about the whole approach where we do the aggregation then the accessor thing in the previous post on two-step aggregation [1]). But there are definitely times when averages/stddev and potentially the 3rd and 4th moments are more useful etc.

[1]: https://blog.timescale.com/blog/how-postgresql-aggregation-w...

achenatx4y ago

Ive been trying to get the marketing team to always include a std deviation with averages. Average alone is simply not useful, standard deviation is a simple way to essentially include percentiles.

They regularly compare experiments to the mean but dont use a T test to ensure the results are actually different from the mean.

doctorsher4y ago

I heavily caution against the feeling that "standard deviation is a simple way to essentially include percentiles." The usefulness of the standard deviation depends on the distributions that you are working with. Heavy tailed distributions appear a fair amount in practice, and the combo of summary statistics mentioned would not do well on those. Also, Madars' comment in this thread is a beautiful example of this: 4 completely different distributions, with identical mean and standard deviation (among other things). Histograms and percentiles, and if necessary their approximations, are more desirable for the above reasons.

slaymaker19074y ago

I assume most of the distributions a marketing department would be dealing with are generally normal in which case stddev is a great way to analyze the data. This can be easily verified by just plotting said data and making sure the tails don't look weird.

1 more reply

LoriP4y ago

One thing I like about this post is that it explains things in an accessible way before getting into a deep dive. Might be worth sharing with the marketing team as they'll "get" long tail in the context of web search, so the concept is fairly transferable to stuff that they would know about.

djk4474y ago

NB: Post author here.

Std deviation definitely helps a lot, still often not as good as percentiles, was actually thinking about adding some of that in the post but it was already getting so long. It's funny how things you think are simple sometimes take the most effort to explain, definitely found that on this one.

waynecochran4y ago

Yeah -- std deviation has a similar problem to the mean in that it doesn't give you a full picture unless the distribution is close to normal / gaussian.

1 more reply

fbinky4y ago

Thanks for the refresher! I am using the timescaledb-toolkit with much success. LTTB, for example. Excellent.

motohagiography4y ago

Recovering product manager in me sees 90th percentile queries with outlier high latency and starts asking instead of how to reduce it, how we can to spin it out into a dedicated premium query feature, as if they're willing to wait, they're probably also willing to pay.

Highly recommend modelling your solution using queueing theory with this: https://queueing-tool.readthedocs.io/en/latest/

As an exercise in discovering the economics of how your platform works, even just thinking about it in these terms can save a great deal of time.

zeteo4y ago

If you want to calculate exact percentiles there's a simple in-place algorithm that runs in expected O(n) time. You basically do quicksort but ignore the "wrong" partition in each recursive step. For instance if your pivot is at the 25% percentile and you're looking for the 10% percentile you only recurse on the "left" partition at that point. It's pretty easy to implement. (And rather straightforward to change to a loop, if necessary.)

solumos4y ago

looking at a histogram is probably the best thing you can do to understand how data is distributed

123pie1234y ago

I had to explain the data usage of an interface that looked extremely busy from the standard graphs

I did a percentile graph of the usage - the data was only typically using 5% of the maximum throughput, no-one could really understand the graph though

so I did a zoomed-in version of the normal data usage graph and it looked like a blip lasting 1/20 of the time - everyone got that - eg it was peaking every few seconds and then doing nothing for ages

datavirtue4y ago

Can someone bake this down to a sentence? I think I understand what they are saying since I have been faced with using these metrics and in having been tempted to use average response times I recognized that average is not a good baseline since it moves in relation to the outliers (which are usually the problem requests and there can be many or few).

How do you calculate these percentiles and use them as a trigger for alerts?

lewispb4y ago

Side note, but I love the animations, code snippet design and typography in this blog post.

Will think about how I can improve my own blog with these ideas.

djk4474y ago

Thank you! Huge shout out to Shane, Jacob and others on our team who helped with the graphics / design elements!

ableal4y ago

Just as a note, same topic: https://news.ycombinator.com/item?id=19552112

""" Averages Can Be Misleading: Try a Percentile (2014) (elastic.co) 199 points by donbox on April 2, 2019 | | 55 comments """

jpgvm4y ago

If you are doing this sort of work I highly recommend the Datasketches library. It's used in Druid, which is a database specifically designed for these sorts of aggregations over ridiculously large datasets.

jaygreco4y ago

FYI, typo:

> “In the below graph, half of the data is to the left (shaded in blue), and a half is to the right (shaded in purple), with the 50th percentile directly in the center.”

But the half on the right is actually shaded yellow.

LoriP4y ago

Thanks for the feedback we've made the change. Appreciated!

djk4474y ago

Thanks! Will get that fixed!

airstrike4y ago

Sorry for the minor nitpick, but I find it a bit unusual (disappointing?) that there's an image of a candlestick chart at the very top, but the article only uses API response times as examples...

Bostonian4y ago

If you think the data is normal-ish but want to account for skew and kurtosis, you can try fitting a distribution such as the skewed Student's t -- there are R packages for this.

camel_gopher4y ago

Toss the percentiles to the curb and just use a histogram.

k__4y ago

Shouldn't averages&variance be enough for start?

pachico4y ago

Surprisingly, many software engineers I know never used percentiles and keep using mean average. True story.

yongjik4y ago

Bah, I'll be happy if I could even get correct averages. I see pipelines getting value X1 from a server that served 100 requests, another value X2 from a server that served one request, and then it returns (X1+X2)/2.

mschuetz4y ago

The mean is something you can easily compute progressively and with trivial resources. Median and percentiles, on the other hand, can be super expensive and potentially unsuitable for some real-time applications, since you need to maintain a sorted list of all relevant samples.

groaner4y ago

Not surprising, because computing mean is O(n) and median is O(n log n).

Lack of resources or pure laziness doesn't make it the right measure to use though.

gpderetta4y ago

Introselect is O(n), right?

ekianjo4y ago

Is there anyone who still uses averages ?

tiffanyh4y ago

Statistics 101. Mean, median and mode.

deft4y ago

Looks like timescale did a big marketing push this morning only for their whole service to go down minutes later. lol.

mherdeg4y ago

I've skimmed some of the literature here when I've spent time trying to help people with their bucket boundaries for Prometheus-style instrumentation of things denominated in "seconds", such as processing time and freshness.

My use case is a little different from what's described here or in a lot of the literature. Some of the differences:

(1) You have to pre-decide on bucket values, often hardcoded or stored in code-like places, and realistically won't bother to update them often unless the data look unusably noisy.

(2) Your maximum number of buckets is pretty small -- like, no more than 10 or 15 histogram buckets probably. This is because my metrics are very high cardinality (my times get recorded alongside other dimensions that may have 5-100 distinct values, things like server instance number, method name, client name, or response status).

(3) I think I know what percentiles I care about -- I'm particularly interested in minimizing error for, say, p50, p95, p99, p999 values and don't care too much about others.

(4) I think I know what values I care about knowing precisely! Sometimes people call my metrics "SLIs" and sometimes they even set an "SLO" which says, say, I want no more than 0.1% of interactions to take more than 500ms. (Yes, those people say, we have accepted that this means that 0.1% of people may have an unbounded bad experience.) So, okay, fine, let's force a bucket boundary at 500ms and then we'll always be measuring that SLO with no error.

(5) I know that the test data I use as input don't always reflect how the system will behave over time. For example I might feed my bucket-designing algorithm yesterday's freshness data and that might have been a day when our async data processing pipeline was never more than 10 minutes backlogged. But in fact in the real world every few months we get a >8 hour backlog and it turns out we'd like to be able to accurately measure the p99 age of processed messages even if they are very old... So despite our very limited bucket budget we probably do want some buckets at 1, 2, 4, 8, 16 hours, even if at design time they seem useless.

I have always ended up hand-writing my own error approximation function which takes as input like

(1) sample data - a representative subset of the actual times observed in my system yesterday

(2) proposed buckets - a bundle of, say, 15 bucket boundaries

(3) percentiles I care about

then returns as output info about how far off (%age error) each estimated percentile is from the actual value for my sample data.

Last time I looked at this I tried using libraries that purport to compute very good bucket boundaries but they give me, like, 1500 buckets with very nice tiny error, but no clear way to make real-world choice about collapsing this into a much smaller set of buckets with comparatively huge, but manageable, error.

I ended up just advising people to

* set bucket boundaries at SLO boundaries, and be sure to update when the SLO does

* actually look at your data and understand the data's shape

* minimize error for the data set you have now; logarithmic bucket sizes with extra buckets near the distribution's current median value seems to work well

* minimize worst-case error if the things you're measuring grow very small or very large and you care about being able to observe that (add extra buckets)

slaymaker19074y ago

One thing you could try to use are variance optimal histograms. These are histograms which set bucket boundaries such that the weighted average variance in the buckets is minimized. Unfortunately, this algorithm is quadratic with the number of data points, but you could try approximating the optimum by taking random subsets of the data, computing the histogram, then seeing how well the histogram does on the whole dataset.

http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/pap...

djk4474y ago

NB: Post author here.

This is interesting and I totally get at least some of the problems you're facing. I wonder if you could take some of the strategies from t-digest and modify a bit to accomplish...I'd be interested in seeing some sort of implementation of this and would love to see if we can get it into our toolkit if you do...or you can also open up a ticket for us and we'll see if we can prioritize to work on something like it ourselves.

I do think there are some interesting ways you can cut corners if you know things about the "SLO" or other sort of cutoff values, but I'd have to think more deeply about it to say more. Essentially we want a variable error rate based on the distance away from a known value, where you have little error in the values relatively close to the known value, care little if at all for fine distinctions on the low end and, once you get past some high end value you could probably bucket everything above it into a "large outliers" bucket or something too...meaning you p999 could get out of hand if you start getting lots of stuff above a threshold, but, if that starts happening, probably everything's already burning down, so might not be that useful to know it's burning down in a very specific way...

wikibob4y ago

Are you following the work on sparse high resolution histograms in both Prometheus and OpenTelemetry?

It's coming together and will be available soon.

What this means is that you won't have to compromise and pick histogram bucket boundaries anymore. And each bucket will be much narrower.

convolvatron4y ago

look at Greenwald-Khanna (and the followon work). It adapts the bucket size to minimize the total error (and the number of buckets is proportional to log-epsilon I think).

with all the competition and 'innovation', you would think the data dogs of this world would grow up beyond time series.

ruchin_k4y ago

Spent several years in venture capital investing and averages were always misleading - as Nassim Taleb says "Never cross a river that is on average 4 feet deep"

robbrown4514y ago

Median/50 percentile isn't a whole lot better in that case.

j / k navigate · click thread line to collapse

163 comments

bqe4y ago

[1]: https://github.com/cxxr/LiveStats

[2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

cyral4y ago

There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)

[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest

djk4474y ago

NB: Post author here.

[1]: https://arxiv.org/abs/1908.10693

2 more replies

jeremysalwen4y ago

1 more reply

riskneutral4y ago

Also relevant: Single-Pass Online Statistics Algorithms

[1] http://www.numericalexpert.com/articles/single_pass_stat/

breuleux4y ago

That's pretty neat! Can these be used to efficiently compute rolling percentiles (over windows of the data), or just incremental?

2 more replies

fotta4y ago

yep I had to implement t-digest in a monitoring library. another alternative (although older) that the prometheus libraries use is CKMS quantiles [0].

[0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-icde.pd...

convolvatron4y ago

1 more reply

tantalor4y ago

> Find the median ... randomly evict items

So, not find, but approximate. That's a different thing.

rendaw4y ago

> without sorting it... have a fixed size sorted buffer

(that you sort yourself)

1 more reply

djk4474y ago

NB: Post author here.

Cd00d4y ago

Did the candidate get an offer? Genuinely curious.

I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.

williamdclt4y ago

bo10244y ago

> if we estimate the median and move it a small amount for each new data point, it would be pretty close.

Yeah, this is gradient descent on the absolute loss.

eru4y ago

> The question was basically, "Find the median of a huge data set without sorting it," [...]

You can use eg QuickSelect (https://en.wikipedia.org/wiki/Quickselect) or Median of Medians.

They don't sort the data, but they do need linear amount of storage.

Izikiel434y ago

> The question was basically, "Find the median of a huge data set without sorting it,"

Isn't this done using a min heap and a max heap in conjuction?

ilikebits4y ago

The real constraint here is probably "find the median of a huge data set without holding the entire data set in memory".

1 more reply

tantalor4y ago

No, for 2 reasons,

1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"

2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules

2 more replies

hnfong4y ago

The two heap method helps you maintain the current median given an incoming stream of “online” data, and technically is not “sorting”.

The original problem is probably more accurately described as “what if you have too much data to sort” though.

5faulker4y ago

It's funny that this is often left out from a data & algorithm class.

2 more replies

ZephyrBlu4y ago

How does this get you the median?

abnry4y ago

One way to think about why we tend to use averages instead of medians is that it is related to a really deep theorem in probability: The Central Limit Theorem.

But in many ways, the median and percentiles are a more relevant and intuitive number to think about. Especially in contexts where linearity is inappropriate!

a-dub4y ago

i think of it as: if the data is gaussian, use a mean, otherwise go non-parametric (medians/percentiles).

or put another way, if you can't model it, you're going to have to sort, or estimate a sort, because that's all that's really left to do.

this shows up in things from estimating centers with means/percentiles to doing statistical tests with things like the wilcoxon tests.

lanstin4y ago

2 more replies

feoren4y ago

> the median and percentiles are a more relevant and intuitive number to think about

The correct impossible statement would be if more than half of drivers are better than the median driver.

nancarrow4y ago

It's more related to the law of large numbers than the CLT

rahimnathwani4y ago

"Why Averages Suck and Percentiles are Great": https://www.dynatrace.com/news/blog/why-averages-suck-and-pe...

contravariant4y ago

It's indeed always worth pointing that a mean may or may not exist. Same with variances/standard-deviation.

The central limit theorem seems to have given people the slightly wrong idea that things will always average out eventually.

maxnoe4y ago

Coughs Cauchy

djk4474y ago

totally. that blog was also one of the sources I mentioned in the post! Good stuff

NB: Post author here.

hnuser1234564y ago

tzs4y ago

wruza4y ago

1 more reply

barrenko4y ago

Presumably, also the reason we have those fake queue lines at airports (not sure the correct word is for it).

1 more reply

djk4474y ago

NB: Post author here.

Love this example. Might have to use that in a future post. Feel like a lot of us are running into a similar thing with remote work and video calls these days...

elmalto4y ago

Similarly, I would rather have no wifi than bad/spotty wifi for this reason

account424y ago

eric-hu4y ago

slaymaker19074y ago

exolymph4y ago

"Slow is smooth and smooth is fast"

kazinator4y ago

Gamers do not have a more intuitive sense for this than movie watchers, video/voice talkers, not to mention users who type text into bloated web browsers or lagged remote login sessions.

I would rather have a rock-steady consistent 500 ms reponse time when typing text, than to have a 100 ms average response time which randomly spikes to outliers that go past one second.

spiderice4y ago

> Gamers do not have a more intuitive sense for this than movie watchers, video/voice talkers, not to mention users who type text into bloated web browsers or lagged remote login sessions

I would be shocked if gamers didn't on average have a more intuitive sense of this than any of the groups you mentioned.

1 more reply

hnuser1234564y ago

Maybe not all, but ones who create and consume analysis like this

https://www.techpowerup.com/review/msi-geforce-rtx-3090-gami...

michaelmdresser4y ago

[1]: https://www.infoq.com/presentations/latency-response-time/

lanstin4y ago

I watch this video once a year and send it to my co-workers whenever averages or medians shows up in a graph for public consumption.

peheje4y ago

Are the points written in a readable format anywhere?

1 more reply

rattray4y ago

What are some of the ways percentiles can be abused?

alexanderdmitri4y ago

# Power Statement

!*Plowser*!* is in the 99th percentile of browsers by global user count!

# Fact Sheet

- !*Plowser*! is used by 2,050 people

- sample size of browsers is 10,000 (this includes toy apps on GitHub and almost all browsers in the sample have no record of active users)

- if we place the number of global browser users at a conservative 1,000,000,000, !*Plowser*! actually has 0.00000205% of users

madars4y ago

Good opportunity to plug https://en.wikipedia.org/wiki/Anscombe%27s_quartet : if you don't know much about the underlying distribution, simple statistics don't describe it well.

pdpi4y ago

There's a fun paper by Autodesk where they make datasets that look whatever way you want them to.

https://www.autodesk.com/research/publications/same-stats-di...

s_gourichon4y ago

Definitely worth a quick look.

djk4474y ago

NB: Post author here.

This is great! So fun...will have to use in the future...

1 more reply

doctorsher4y ago

djk4474y ago

NB: Post author here.

hoseja4y ago

>importance of graphing data before analyzing it

Very discouraging if one is trying to analyze data algorithmically. Often when faced with a problem in statistics, the answer is: "Look at the graph and use intuition!".

potatoman224y ago

Interesting, thanks for the concept. What's one to do for high dimensional data then?

tunesmith4y ago

Our ops team doesn't really follow this either. We're not a huge department though - is this the kind of stuff that an SRE team usually pays attention to?

djk4474y ago

NB: Post author here.

tunesmith4y ago

I think the weirdly-shaped long-tail graphs we come across are just sums of more naturally-distributed response times, for different types of responses. Another reason to limit variation I think.

1 more reply

bluesmoon4y ago

This is exactly the algorithm we developed at LogNormal (now part of Akamai) 10 years ago for doing fast, low-memory percentiles on large datasets.

It's implemented in this Node library: https://github.com/bluesmoon/node-faststats

JackFr4y ago

No we’re stuck with it because average was used colloquially for arithmetic mean for decades.

I wish people would stop bad-mouthing the arithmetic mean. If you have to convey information about a distribution and you’ve got only one number to do it, the arithmetic mean is for you.

jrochkind14y ago

I think it still depends on the nature of the data and your questions about it, and median is often the better choice if you have to pick only one.

1 more reply

bluesmoon4y ago

Yes, Lotus 123 came out 38 years ago :)

louisnow4y ago

djk4474y ago

NB: Post author here.

Oops, yep, that should probably be order from smallest to largest. Thanks for the correction!

djk4474y ago

Fixed!

emgo4y ago

Yes, this looks like a typo in the article. It should be smallest to largest.

LoriP4y ago

Thanks, we corrected this quite quickly. Appreciated!

tdiff4y ago

satvikpendem4y ago

https://news.ycombinator.com/item?id=28518795

ChuckMcM4y ago

It also always separated the 'good' Ad networks from the 'bad' ones as the bad ones would take to long to respond.

aidenn04y ago

djk4474y ago

NB: Post author here.

axpy9064y ago

There’s something called a five number summary in statistics: mean, median, standard deviation, 25th percentile and 75th percentile.

The bonus is that the 75th - 50th gives you the interquartile range.

Mean is not a robust measure and as such you need to look at variety to truly understand the spread of your data.

10000truths4y ago

bluesmoon4y ago

IQR is 75th - 25th, aka, the middle-50%

monkeydust4y ago

Ok this, box plots are a good way to visualize and show distribution esp to a not so stat heavy audience.

axpy9064y ago

You’re right I mistyped.

jonnydubowsky4y ago

I really enjoyed this post! The author also wrote an interactive demonstration of the concepts (using Desmos). It's super helpful.

https://www.desmos.com/calculator/ty3jt8ftgs

djk4474y ago

NB: Post author here.

Glad you liked it! I was so excited to actually be able to get to use Desmos for something work wise, I've been wanting to do it for years!

Lightbody4y ago

Whenever this topic comes up, I always encourage folks to watch this 2011 classic 15m talk at the Velocity conference by John Rauser:

https://www.youtube.com/watch?v=coNDCIMH8bk

varelaz4y ago

djk4474y ago

NB: Post author here.

[1]: https://blog.timescale.com/blog/how-postgresql-aggregation-w...

achenatx4y ago

Ive been trying to get the marketing team to always include a std deviation with averages. Average alone is simply not useful, standard deviation is a simple way to essentially include percentiles.

They regularly compare experiments to the mean but dont use a T test to ensure the results are actually different from the mean.

doctorsher4y ago

slaymaker19074y ago

1 more reply

LoriP4y ago

djk4474y ago

NB: Post author here.

waynecochran4y ago

Yeah -- std deviation has a similar problem to the mean in that it doesn't give you a full picture unless the distribution is close to normal / gaussian.

1 more reply

fbinky4y ago

Thanks for the refresher! I am using the timescaledb-toolkit with much success. LTTB, for example. Excellent.

motohagiography4y ago

Highly recommend modelling your solution using queueing theory with this: https://queueing-tool.readthedocs.io/en/latest/

As an exercise in discovering the economics of how your platform works, even just thinking about it in these terms can save a great deal of time.

zeteo4y ago

solumos4y ago

looking at a histogram is probably the best thing you can do to understand how data is distributed

123pie1234y ago

I had to explain the data usage of an interface that looked extremely busy from the standard graphs

I did a percentile graph of the usage - the data was only typically using 5% of the maximum throughput, no-one could really understand the graph though

datavirtue4y ago

How do you calculate these percentiles and use them as a trigger for alerts?

lewispb4y ago

Side note, but I love the animations, code snippet design and typography in this blog post.

Will think about how I can improve my own blog with these ideas.

djk4474y ago

Thank you! Huge shout out to Shane, Jacob and others on our team who helped with the graphics / design elements!

ableal4y ago

Just as a note, same topic: https://news.ycombinator.com/item?id=19552112

""" Averages Can Be Misleading: Try a Percentile (2014) (elastic.co) 199 points by donbox on April 2, 2019 | | 55 comments """

jpgvm4y ago

jaygreco4y ago

FYI, typo:

> “In the below graph, half of the data is to the left (shaded in blue), and a half is to the right (shaded in purple), with the 50th percentile directly in the center.”

But the half on the right is actually shaded yellow.

LoriP4y ago

Thanks for the feedback we've made the change. Appreciated!

djk4474y ago

Thanks! Will get that fixed!

airstrike4y ago

Sorry for the minor nitpick, but I find it a bit unusual (disappointing?) that there's an image of a candlestick chart at the very top, but the article only uses API response times as examples...

Bostonian4y ago

If you think the data is normal-ish but want to account for skew and kurtosis, you can try fitting a distribution such as the skewed Student's t -- there are R packages for this.

camel_gopher4y ago

Toss the percentiles to the curb and just use a histogram.

k__4y ago

Shouldn't averages&variance be enough for start?

pachico4y ago

Surprisingly, many software engineers I know never used percentiles and keep using mean average. True story.

yongjik4y ago

mschuetz4y ago

groaner4y ago

Not surprising, because computing mean is O(n) and median is O(n log n).

Lack of resources or pure laziness doesn't make it the right measure to use though.

gpderetta4y ago

Introselect is O(n), right?

ekianjo4y ago

Is there anyone who still uses averages ?

tiffanyh4y ago

Statistics 101. Mean, median and mode.

deft4y ago

Looks like timescale did a big marketing push this morning only for their whole service to go down minutes later. lol.

mherdeg4y ago

My use case is a little different from what's described here or in a lot of the literature. Some of the differences:

(1) You have to pre-decide on bucket values, often hardcoded or stored in code-like places, and realistically won't bother to update them often unless the data look unusably noisy.

(3) I think I know what percentiles I care about -- I'm particularly interested in minimizing error for, say, p50, p95, p99, p999 values and don't care too much about others.

I have always ended up hand-writing my own error approximation function which takes as input like

(1) sample data - a representative subset of the actual times observed in my system yesterday

(2) proposed buckets - a bundle of, say, 15 bucket boundaries

(3) percentiles I care about

then returns as output info about how far off (%age error) each estimated percentile is from the actual value for my sample data.

I ended up just advising people to

* set bucket boundaries at SLO boundaries, and be sure to update when the SLO does

* actually look at your data and understand the data's shape

* minimize error for the data set you have now; logarithmic bucket sizes with extra buckets near the distribution's current median value seems to work well

* minimize worst-case error if the things you're measuring grow very small or very large and you care about being able to observe that (add extra buckets)

slaymaker19074y ago

http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/pap...

djk4474y ago

NB: Post author here.

wikibob4y ago

Are you following the work on sparse high resolution histograms in both Prometheus and OpenTelemetry?

It's coming together and will be available soon.

What this means is that you won't have to compromise and pick histogram bucket boundaries anymore. And each bucket will be much narrower.

convolvatron4y ago

look at Greenwald-Khanna (and the followon work). It adapts the bucket size to minimize the total error (and the number of buckets is proportional to log-epsilon I think).

with all the competition and 'innovation', you would think the data dogs of this world would grow up beyond time series.

ruchin_k4y ago

Spent several years in venture capital investing and averages were always misleading - as Nassim Taleb says "Never cross a river that is on average 4 feet deep"

robbrown4514y ago

Median/50 percentile isn't a whole lot better in that case.

j / k navigate · click thread line to collapse