My Favorite Algorithm: Linear Time Median Finding (2018) (opens in new tab)

(rcoh.me)

371 pointsskanderbm1y ago181 comments

181 comments

Around 4 years ago I compared lots of different median algorithms and the article turned out to be much longer than I anticipated :)

https://danlark.org/2020/11/11/miniselect-practical-and-gene...

thanatropism1y ago

Is any of those easily modifiable to return the arg-median (the index which has the median).

cfors1y ago

Just wanted to say thank you for this article - I've read and shared this a few times over the years!

rented_mule1y ago

10-15 years ago, I found myself needing to regularly find the median of many billions of values, each parsed out of a multi-kilobyte log entry. MapReduce was what we were using for processing large amounts of data at the time. With MapReduce over that much data, you don't just want linear time, but ideally single pass, distributed across machines. Subsequent passes over much smaller amounts of data are fine.

It was a struggle until I figured out that knowledge of the precision and range of our data helped. These were timings, expressed in integer milliseconds. So they were non-negative, and I knew the 90th percentile was well under a second.

As the article mentions, finding a median typically involves something akin to sorting. With the above knowledge, bucket sort becomes available, with a slight tweak in my case. Even if the samples were floating point, the same approach could be used as long as an integer (or even fixed point) approximation that is very close to the true median is good enough, again assuming a known, relatively small range.

The idea is to build a dictionary where the keys are the timings in integer milliseconds and the values are a count of the keys' appearance in the data, i.e., a histogram of timings. The maximum timing isn't known, so to ensure the size of the dictionary doesn't get out of control, use the knowledge that the 90th percentile is well under a second and count everything over, say, 999ms in the 999ms bin. Then the dictionary will be limited to 2000 integers (keys in the range 0-999 and corresponding values) - this is the part that is different from an ordinary bucket sort. All of that is trivial to do in a single pass, even when distributed with MapReduce. Then it's easy to get the median from that dictionary / histogram.

justinpombrio1y ago

Did you actually need to find the true median of billions of values? Or would finding a value between 49.9% and 50.1% suffice? Because the latter is much easier: sample 10,000 elements uniformly at random and take their median.

(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)

rented_mule1y ago

The kind of margin you indicate would have been plenty for our use cases. But, we were already processing all these log entries for multiple other purposes in a single pass (not one pass per thing computed). With this single pass approach, the median calculation could happen with the same single-pass parsing of the logs (they were JSON and that parsing was most of our cost), roughly for free.

Uniform sampling also wasn't obviously simple, at least to me. There were thousands of log files involved, coming from hundreds of computers. Any single log file only had timings from a single computer. What kind of bias would be introduced by different approaches to distributing those log files to a cluster for the median calculation? Once the solution outlined in the previous comment was identified, that seemed simpler that trying to understand if we were talking about 49-51% or 40-50%. And if it was too big a margin, restructuring our infra to allow different log file distribution algorithms would have been far more complicated.

jiggawatts1y ago

Speaking of "single pass", one of the criticisms I have of the "enumerator" patterns in modern programming languages is that they encourage multiple passes.

As an example: computing the .min() and .max() of an enumerable is two passes even though it could be done with one pass.

I'd love to see a language embrace a more efficient style similar to how a SQL does it, where you can elegantly request this as a single pass over the data: "SELECT min(x), max(x) FROM y"

3 more replies

sroussey1y ago

Actually, seeking the bias numbers can be quite illuminating.

1 more reply

enriquto1y ago

> the latter is much easier: sample 10,000 elements uniformly at random and take their median

Do you have a source for that claim?

I don't see how could that possibly be true... For example, if your original points are sampled from two gaussians of centers -100 and 100, of small but slightly different variance, then the true median can be anywhere between the two centers, and you may need a humungous number of samples to get anywhere close to it.

True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.

B1FF_PSUVM1y ago

> this does not mean that the median can be found accurately by your method.

You can do dynamic sampling: e.g. take double the samples, see what decimal in your result budges. Adjust.

rhymer1y ago

Asymptotic properties of quantile estimators are widely studied [1]. The key is to have a sufficiently large sample size.

[1] Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577–580.

1 more reply

maronato1y ago

the key word is “uniformly”. If your data is not uniformly distributed, then you just have to pick random values non-uniformly. There are many ways to do that, and once you have your algo you’ll be able to reliably find an approximation of the median much faster than you would find the actual median.

https://en.m.wikipedia.org/wiki/Non-uniform_random_variate_g...

justinpombrio1y ago

I wasn't saying that you could get within 1% of the true median, I was saying you could find an element in the 49th to 51st percentile. In your example, the 49th percentile would be -90 and the 51st percentile would be 90, so the guarantee is that you would (with very high probability) find a number between -90 and 90.

That's a good point, though, that the 49th percentile and 51st percentile can be arbitrarily far from the median.

andruby1y ago

I was thinking the same thing.

In all use-cases I've seen a close estimate of the median was enough.

hhmc1y ago

You also can use the fact that for any distribution, the median is never further than 1SD away from the mean.

digaozao1y ago

I am not sure. But from the outside, it looks like what Prometheus does behind the scenes. It seems to me that Prometheus works like that because it has a limit on latency time around 10s on some systems I worked. So when we had requests above that limit it got all on 10s, even though it could be higher than that. Interesting.

Filligree1y ago

Was this by any chance for generating availability metrics, and were you an intern at the time? The system sounds, ah, very familiar.

rented_mule1y ago

The metrics were about speed. And I was decades past my last internship at the time in question. But, as is so often the case, more than one of us may have been reinventing pretty similar wheels. :)

ant6n1y ago

I’m not sure why you use a dictionary with keys 0…999, instead of an array indexed 0…999.

rented_mule1y ago

I was using the term dictionary for illustration purposes. Remember, this was all in the context of MapReduce. Computation within MapReduce is built around grouping values by keys, which makes dictionaries a natural way to think about many MapReduce oriented algorithms, at least for me. The key/value pairs appear as streams of two-tuples, not as dictionaries or arrays.

tomrod1y ago

That's just a dict/map with less flexibility on the keys :D

ashton3141y ago

Where were you working? Sounds like you got lucky to work on some fun problems!

rented_mule1y ago

Sorry, but I'm trying to keep this account relatively anonymous to sidestep some of my issues with being shy.

But, you're right, I was lucky to work on a bunch of fun problems. That period, in particular, was pretty amazing. I was part of a fun, collaborative team working on hard problems. And management showed a lot of trust in us. We came up with some very interesting solutions, some by skill and some by luck, that set the foundation for years of growth that came after that (both revenue growth and technical platform growth).

ashton3141y ago

> I'm trying to keep this account relatively anonymous

I totally respect that—don't let me pressure you any further.

Yeah, sounds like you had the ideal work setup: cool problem requiring cool technical solutions with good management. That's wonderful!

xinok1y ago

> P.S: In 2017 a new paper came out that actually makes the median-of-medians approach competitive with other selection algorithms. Thanks to the paper’s author, Andrei Alexandrescu for bringing it to my attention!

He also gave a talk about his algorithm in 2016. He's an entertaining presenter, I highly recommended!

There's Treasure Everywhere - Andrei Alexandrescu

https://www.youtube.com/watch?v=fd1_Miy1Clg

_yid91y ago

Andrei Alexandrescu is awesome; around 2000 he gave on talk on lock-free wait-free algorithms that I immediately applied to a huge C++ industrial control networking project at the time.

I'd recommend anyone who writes software listening and reading anything of Andrei's you can find; this one is indeed a Treasure!

fasa991y ago

that's wild, a bit of a polymath by computer science standards. I know him from template metaprogramming fame and here he is shifting from programming languages to algorithms

mabbo1y ago

I learned about the median-of-medians quickselect algorithm when I was an undergrad and was really impressed by it. I implemented it, and it was terribly slow. It's runtime grew linearly, but that only really mattered if you had at least a few billion items in your list.

I was chatting about this with a grad student friend who casually said something like "Sure, it's slow, but what really matters is that it proves that it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible. Now that we do, we know there might an even faster linear algorithm." Really got into the philosophy of what Computer Science is about in the first place.

The lesson was so simple yet so profound that I nearly applied to grad school because of it. I have no idea if they even recall the conversation, but it was a pivotal moment of my education.

zelphirkalt1y ago

Does the fact, that any linear time algorithm exist, indicate, that a faster linear time algorithm exists? Otherwise, what is the gain from that bit of knowledge? You could also think: "We already know, that some <arbitrary O(...)> algorithm exists, there might be an even faster <other O(...)> algorithm!" What makes the existence of an O(n) algo give more indication, than the existence of an O(n log(n)) algorithm?

blt1y ago

I am not the original commenter, but I (and probably many CS students) have had similar moments of clarity. The key part for me isn't

> there might be an even faster linear algorithm,

but

> it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible.

For me, the moment of clarity was understanding that theoretical CS mainly cares about problems, not algorithms. Algorithms are tools to prove upper bounds on the complexity of problems. Lower bounds are equally important and cannot be proved by designing algorithms. We even see theorems of the form "there exists an O(whatever) algorithm for <problem>": the algorithm's existence can sometimes be proven non-constructively.

So if the median problem sat for a long time with a linear lower bound and superlinear upper bound, we might start to wonder if the problem has a superlinear lower bound, and spend our effort working on that instead. The existence of a linear-time algorithm immediately closes that path. The only remaining work is to tighten the constant factor. The community's effort can be focused.

A famous example is the linear programming problem. Klee and Minty proved an exponential worst case for the simplex algorithm, but not for linear programming itself. Later, Khachiyan proved that the ellipsoid algorithm was polynomial-time, but it had huge constant factors and was useless in practice. However, a few years later, Karmarkar gave an efficient polynomial-time algorithm. One can imagine how Khachiyan's work, although inefficient, could motivate a more intense focus on polynomial-time LP algorithms leading to Karmarkar's breakthrough.

anonymoushn1y ago

If you had two problems, and a linear time solution was known to exist for only one of them, I think it would be reasonable to say that it's more likely that a practical linear time solution exists for that one than for the other one.

mrguyorama1y ago

We studied (I believe) this algorithm in my senior year of Computer Science. We talked about the theory side of it that you mention, but this algorithm was also used to demonstrate that "slow linear algorithm" is not faster than "Fast nlogn algorithm" in most real life cases.

I think we got a constant factor of 22 for this algorithm so maybe it was a related one or something.

kwantam1y ago

One of the fun things about the median-of-medians algorithm is its completely star-studded author list.

Manuel Blum - Turing award winner in 1995

Robert Floyd - Turing award winner in 1978

Ron Rivest - Turing award winner in 2002

Bob Tarjan - Turing award winner in 1986 (oh and also the inaugural Nevanlinna prizewinner in 1982)

Vaughan Pratt - oh no, the only non-Turing award winner in the list. Oh right but he's emeritus faculty at Stanford, directed the SUN project before it became Sun Microsystems, was instrumental in Sun's early days (director of research and designer of the Sun logo!), and is responsible for all kinds of other awesome stuff (near and dear to me: Pratt certificates of primality).

Four independent Turing awards! SPARCstations! This paper has it all.

jiggawatts1y ago

Job interview question for an entry-level front end developer: "Reproduce the work of four Turing award winners in the next thirty minutes. You have a dirty whiteboard and a dry pen. Your time begins... now."

ted_dunning1y ago

And if you really want to impress, you reach into your pack and pull out the pens you carry just in case you run into dry pens at a critical moment.

jiggawatts1y ago

I have actually done this. I bought myself a pack of premium pens at an art store and kept them in my laptop bag for customer site visits. They were designed for writing on glass but also worked great on whiteboards. It’s fun when your diagrams just look better than anyone else’s. It’s like a report document with good typography. It can be filled with made up BS but it still impresses subconsciously.

Munksgaard1y ago

Here's a direct link for anyone who, like me, would be interested in reading the original article: https://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf

That's an impressive list of authors, for sure.

praptak1y ago

Some other awesome stuff by Pratt:

Pratt parsing (HN discussion: https://news.ycombinator.com/item?id=39066465), the "P" in the KMP algorithm.

someplaceguy1y ago

    return l[len(l) / 2]

I'm not a Python expert, but doesn't the `/` operator return a float in Python? Why would you use a float as an array index instead of doing integer division (with `//`)?

I know this probably won't matter until you have extremely large arrays, but this is still quite a code smell.

Perhaps this could be forgiven if you're a Python novice and hadn't realized that the two different operators exist, but this is not the case here, as the article contains this even more baffling code which uses integer division in one branch but float division in the other:

    def quickselect_median(l, pivot_fn=random.choice):
        if len(l) % 2 == 1:
            return quickselect(l, len(l) // 2, pivot_fn)
        else:
            return 0.5 * (quickselect(l, len(l) / 2 - 1, pivot_fn) +
                           quickselect(l, len(l) / 2, pivot_fn))

That we're 50 comments in and nobody seems to have noticed this only serves to reinforce my existing prejudice against the average Python code quality.

jononor1y ago

Well spotted! In Python 2 there was only one operator, but in Python 3 they are distinct. Indexing an array with a float raises an exception, I believe.

runeblaze1y ago

I do agree that it is a code smell. However given that this is an algorithms article I don't think it is exactly that fair to judge it based on code quality. I think of it as: instead of writing it in pseudocode the author chose a real pseudocode-like programming language, and it (presumably) runs well for illustrative purposes.

TacticalCoder1y ago

I really enjoyed TFA but this:

> Technically, you could get extremely unlucky: at each step, you could pick the largest element as your pivot. Each step would only remove one element from the list and you’d actually have O(n2) performance instead of O(n)

If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen. If the data is really too big to shuffle, then only shuffle once a bucket is small enough to be shuffled.

If you do shuffle, probabilities are here to guarantee that that worst case cannot happen. If anyone says that "technically" it can happen, I'll answer that then "technically" an attacker could also guess correctly every bit of your 256 bits private key.

Our world is build on probabilities: all our private keys are protected by the mathematical improbability that someone shall guess them correctly.

From what I read, a shuffle followed by quickselect is O(n) for all practical purposes.

bo10241y ago

You're already using your own randomness to pick the pivot at random, so I don't see why the shuffle helps more. But yes, if your randomness is trustworthy, the probability of more than O(n) runtime is very low.

Reubend1y ago

> If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen.

It doesn't guarantee that you avoid the worst case, it just removes the possibility of forcing the worst case.

furstenheim1y ago

Floyd Ryvest also does the job . A bit more efficient IIRC.

However I never managed to understand how it works.

https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...

throwaway2945311y ago

If you're selecting the n:th element, where n is very small (or large), using median-of-medians may not be the best choice.

Instead, you can use a biased pivot as in [1] or something I call "j:th of k:th". Floyd-Rivest can also speed things up. I have a hobby project that gets 1.2-2.0x throughput when compared to a well implemented quickselect, see: https://github.com/koskinev/turboselect

If anyone has pointers to fast generic & in-place selection algorithms, I'm interested.

[1] https://doi.org/10.4230/LIPIcs.SEA.2017.24

mgaunard1y ago

You could also use one of the streaming algorithms which allow you to compute approximations for arbitrary quantiles without ever needing to store the whole data in memory.

cosmic_quanta1y ago

That is cool if you can tolerate approximations. But the uncomfortable questions soon arise: Can I tolerate an approximate calculation? What assumptions about my data do I to determine an error bound? How to verify the validity of my assumptions on an ongoing basis?

Personally I would gravitate towards the quickselect algorithm described in the OP until I was forced to consider a streaming median approximation method.

skrtskrt1y ago

A use case for something like this, and not just for medians, is where you have a querying/investigation UI like Grafana or Datadog or whatever.

You write the query and the UI knows you're querying metric xyz_inflight_requests, it runs a preflight check to get the cardinality of that metric, and gives you a prompt: "xyz_inflight_requests is a high-cardinality metric, this query may take some time - consider using estimated_median instead of median".

conradludgate1y ago

Well, I believe you could use the streaming algorithms to pick the likely median, so help choose the pivot for the real quickselect. quickselect can be done inplace too which is O(1) memory if you can afford to rearrange the data.

SkiFire131y ago

Then you don't get a guaranteed O(n) complexity if the approximated algorithm happen to make bad choices

mgaunard1y ago

say you want the median value since the beginning of the day for a time series that has 1000 entries per second, that's potentially hundreds of gigabytes in RAM, or just a few variables if you use a p-square estimator.

mgaunard1y ago

any approximation gives a bound on the error.

sevensor1y ago

I’ve definitely had situations where a streaming quantile algorithm would have been useful, do you have any references?

fanf21y ago

There are two kinds:

- quantile sketches, such as t-digest, which aim to control the quantile error or rank error. Apache DataSketches has several examples, https://datasketches.apache.org/docs/Quantiles/QuantilesOver...

- histograms, such as my hg64, or hdr histograms, or ddsketch. These control the value error, and are generally easier to understand and faster than quantile sketches. https://dotat.at/@/2022-10-12-histogram.html

ssfrr1y ago

Do these both assume the quantile is stationary, or are they also applicable in tracking a rolling quantile (aka quantile filtering)? Below I gave an algorithm I’ve used for quantile filtering, but that’s a somewhat different problem than streaming single-pass estimation of a stationary quantile.

1 more reply

throwaway_24941y ago

See also the Greenwald Khanna quantile estimator, an online algorithm which can compute any quantile within a given ϵ.

https://aakinshin.net/posts/greenwald-khanna-quantile-estima...

1 more reply

sevensor1y ago

Awesome, you did one! I’ll give it a read.

ssfrr1y ago

Here's a simple one I've used before. It's a variation on FAME (Fast Algorithm for Median Estimation) [1].

You keep an estimate for the current quantile value, and then for each element in your stream, you either increment (if the element is greater than your estimate) or decrement (if the element is less than your estimate) by fixed "up -step" and "down-step" amounts. If your increment and decrement steps are equal, you should converge to the median. If you shift the ratio of increment and decrement steps you can estimate any quantile.

For example, say that your increment step is 0.05 and your decrement step is 0.95. When your estimate converges to a steady state, then you must be hitting greater values 95% of the time and lesser values 5% of the time, hence you've estimated the 95th percentile.

The tricky bit is choosing the overall scale of your steps. If your steps are very small relative to the scale of your values, it will converge very slowly and not track changes in the stream. You don't want your steps to be too large because they determine the precision. The FAME algorithm has a step where you shrink your step size when your data value is near your estimate (causing the step size to auto-scale down).

[1]: http://www.eng.tau.ac.il/~shavitt/courses/LargeG/streaming-m...

[2]: https://stats.stackexchange.com/a/445714

ted_dunning1y ago

The state of the art has moved well beyond these algorithms. See these

https://github.com/tdunning/t-digest https://www.sciencedirect.com/science/article/pii/S266596382... https://arxiv.org/pdf/2102.09299

And, as I mentioned else-thread, exponential histograms are the best choice in almost all practical situations.

sevensor1y ago

I like how straightforward this one is! It’s fast, it’s obvious, and it’s good enough, if you know something about the data ahead of time. I would have reached for this about four years ago, if I’d known about it.

fzy951y ago

"Further analysis of the remedian algorithm" https://www.sciencedirect.com/science/article/pii/S030439751...

This one has a streaming variant.

sevensor1y ago

Thanks!!

anonymoushn1y ago

One commonly sees the implication that radix sort cannot be used for data types other than integers, or for composite data types, or for large data types. For example, TFA says you could use radix sort if your input is 32-bit integers. But you can use it on anything. You can use radix sort to sort strings in O(n) time.

contravariant1y ago

It should also be noted that radix sort is ridiculously fast because it just scans linearly through the list each time.

It's actually hard to come up with something that cannot be sorted lexicographically. The best example I was able to find was big fractions. Though even then you could write them as continued fractions and sort those lexicographically (would be a bit trickier than strings).

anonymoushn1y ago

Sorting fractions by numerical value is a good example. Previously I've heard that there are some standard collation schemes for some human languages that resist radix sort, but when I asked about which ones in specific I didn't hear back :(

contravariant1y ago

The Unicode Collation algorithm doesn't look fun to implement in radix sort, but not entirely impossible either. They do note that some characters are contextual, an example they give is that CH can be treated as a single character that sorts after C (so also after CZ). Technically that is still lexicographical, but not byte-for-byte.

exDM691y ago

Problem with radix sorting strings is that it is O(k*N) where k is length of key, in this case it's the second longest string's length. Additional problems arise if you are dealing with null terminated strings and do not have the length stored.

Radix sort is awesome if k is small, N is huge and/or you are using a GPU. On a CPU, comparison based sorting is faster in most cases.

anonymoushn1y ago

No, it's O(N+M) where N is the number of strings and M is the sum of the lengths of the strings. Maybe your radix sort has some problems?

I evaluated various sorts for strings as part of my winning submission to https://easyperf.net/blog/2022/05/28/Performance-analysis-an... and found https://github.com/bingmann/parallel-string-sorting to be helpful. For a single core, the fastest implementation among those in parallel-string-sorting was a radix sort, so my submission included a radix sort based on that one.

The only other contender was multi-key quicksort, which is notably not a comparison sort (i.e. a general-purpose string comparison function is not used as a subroutine of multi-key quicksort). In either case, you end up operating on something like an array of structs containing a pointer to the string, an integer offset into the string, and a few cached bytes from the string, and in either case I don't really know what problems you expect to have if you're dealing with null-terminated strings.

A very similar similar radix sort is included in https://github.com/alichraghi/zort which includes some benchmarks, but I haven't done the work to have it work on strings or arbitrary structs.

Aardwolf1y ago

> No, it's O(N+M) where N is the number of strings and M is the sum of the lengths of the strings.

That would mean it's possible to sort N random 64-bit integers in O(N+M) which is just O(N) with a constant factor of 9 (if taking the length in bytes) or 65 (if taking the length in bits), so sort billions of random integers in linear time, is that truly right?

EDIT: I think it does make sense, M is length*N, and in scenarios where this matters this length will be in the order of log(N) anyway so it's still NlogN-sh.

2 more replies

dataflow1y ago

If you have a million 1-character strings and one string of length 1 million, how many steps would your LSD radix sort take? And (if it's indeed linear in the total input size like you say) how do you make it jump over the empty slots without losing real-world efficiency in other cases?

1 more reply

ncruces1y ago

An implementation in Go, that's (hopefully) simple enough to be understandable, yet minimally practical:

https://github.com/ncruces/sort/blob/main/quick/quick.go

Xcelerate1y ago

I received a variant of this problem as an interview question a few months ago. Except the linear time approach would not have worked here, since the list contains trillions of numbers, you only have sequential read access, and the list cannot be loaded into memory. 30 minutes — go.

First I asked if anything could be assumed about the statistics on the distribution of the numbers. Nope, could be anything, except the numbers are 32-bit ints. After fiddling around for a bit I finally decided on a scheme that creates a bounding interval for the unknown median value (one variable contains the upper bound and one contains the lower bound based on 2^32 possible values) and then adjusts this interval on each successive pass through the data. The last step is to average the upper and lower bound in case there are an odd number of integers. Worst case, this approach requires O(log n) passes through the data, so even for trillions of numbers it’s fairly quick.

I wrapped up the solution right at the time limit, and my code ran fine on the test cases. Was decently proud of myself for getting a solution in the allotted time.

Well, the interview feedback arrived, and it turns out my solution was rejected for being suboptimal. Apparently there is a more efficient approach that utilizes priority heaps. After looking up and reading about the priority heap approach, all I can say is that I didn’t realize the interview task was to re-implement someone’s PhD thesis in 30 minutes...

I had never used leetcode before because I never had difficulty with prior coding interviews (my last job search was many years before the 2022 layoffs), but after this interview, I immediately signed up for a subscription. And of course the “median file integer” question I received is one of the most asked questions on the list of “hard” problems.

geraldwhen1y ago

These are horrible interview questions. Data issues like this do pop up in the real world; heck I deal with them.

But when you run into <algorithm> or <feels like algorithm>, the correct solution is to slow down, Google, then document your work as you go. In a real application log n may be insufficient. But coding interview exercises need tight constraints to fit the nature of the interview.

pfdietz1y ago

You don't need to load trillions of numbers into memory, you just need to count how many of each number there are. This requires 2^32 words of memory, not trillions of words. After doing that just scan down the array of counts, summing, until you find the midpoint.

Xcelerate1y ago

Yeah, I thought of that actually, but the interviewer said “very little memory” at one point which gave me the impression that perhaps I only had some registers available to work with. Was this an algorithm for an embedded system?

The whole problem was kind of miscommunicated, because the interviewer showed up 10 minutes late, picked a problem from a list, and the requirements for the problem were only revealed when I started going a direction the interviewer wasn’t looking for (“Oh, the file is actually read-only.” “Oh, each number in the file is an integer, not a float.”)

jagged-chisel1y ago

That “miscommunication” you mention has been used against me in several interviews, because I was expected to ask questions (and sometimes a specific question they had in mind) before making assumptions. Well, then the 30min becomes an exercise in requirements gathering and not algorithmic implementation.

1 more reply

creata1y ago

With 256 counters, you could use the same approach with four passes: pass i bins the numbers by byte i (0 = most sig., 3 = least sig.) and then identifies the bin that contains the median.

I really want to know what a one-pass, low-memory solution looks like, lol.

1 more reply

jiggawatts1y ago

I've heard of senior people applying for jobs like this simply turning the interview question around and demanding that the person asking it solve it in the allotted time. A surprisingly high percentage of the time they can't.

Xcelerate1y ago

This company receives so many candidates that the interviewer would have just ended the call and moved on to the next candidate.

I get the notion of making the point out of principle, but it’s sort of like arguing on the phone with someone at a call center—it’s better to just cut your losses quickly and move on to the next option in the current market.

jagged-chisel1y ago

And we, as software engineers, should also take that advice: it's better to just cut your losses quickly and move on to the next option in the current market.

anonymoushn1y ago

Do you mean topK rather than median, for K small? You certainly cannot build a heap with trillions of items in it.

Xcelerate1y ago

No, I mean median. Here is an article describing a very similar problem since I can’t link to the leetcode version: https://www.geeksforgeeks.org/median-of-stream-of-running-in...

raincole1y ago

> Auxiliary Space : O(n).

> The Space required to store the elements in Heap is O(n).

I don't think this algorithm is suitable for trillions of items.

osti1y ago

I'm wondering what heap approach can solve that problem, as I can't think of any. Hopefully OP got a link to the thesis.

The n log n approach definitely works though.

Tarean1y ago

But that stores all elements into memory?

ted_dunning1y ago

That's a bullshit question.

My own response would have been a variant on radix-sort. Keep an array of 256 counters, and make a pass counting all of the high bytes. Now you know the high byte of the median. Make another pass keeping a histogram of the second byte of all values that match the high byte. And so on.

This takes four passes and requires 256 x 8 byte counters plus incidentals.

In a single pass you can't get the exact answer.

jagged-chisel1y ago

> … I didn’t realize the interview task was to re-implement someone’s PhD thesis in 30 minutes...

What a bullshit task. I’m beginning to think this kind of interviewing should be banned. Seems to me it’s just an easy escape hatch for the interviewer/hiring manager when they want to discriminate based on prejudice.

ted_dunning1y ago

Banning stupid interview questions is a bad idea for job applicants since they are such a good indication of bullshit job culture.

jagged-chisel1y ago

Excellent point.

jagged-chisel1y ago

It's quicksort with a modification to select the median during the process. I feel like this is a good way to approach lots of "find $THING in list" questions.

mnw21cam1y ago

It's quicksort, but neglecting a load of the work that quicksort would normally have to do. Instead of recursing twice, leading to O(nlogn) behaviour, it's only recursing once.

KMag1y ago

I used to ask how to find the 10th percentile value from an arbitrarily ordered list as an interview question. Most candidates suggested sorting, and then I'd ask if they could do better. If they got stuck, I'd ask them which sorting algorithm they'd suggest. If they suggested quicksort, then I could gently guide them down optimizing quicksort to quickselect. Most candidates made the mistake of believing getting rid of half the work at every division results in half the work overall. They realized it was significantly faster, but usually didn't realize it was O(N) expected time.

If we had time, I'd ask about the worst-case scenario, and see if they could optimize heapsort to heapselect. Good candidates could suggest starting out with selectsort optimistically and switching to heapselect if the number of recursions exceeded some constant times the number of expected recursions.

If they knew about median-of-medians, they could probably just suggest introselect at the start, and move on to another question.

jagged-chisel1y ago

ETA: You said "used to" and I didn't acknowledge that. This is targeted at that kind of interview, not you directly.

---

Had some "lucky" candidate stumbled upon an optimization you had never seen before, would you recognize it? If so, would you be honest and let them keep their discovery? After all, this isn't work for hire ...

Moving on. Did these interviews reflect the day-to-day work these software engineers would be performing if they were accepted? I guarantee your business isn't going to recoup millions of dollars because someone, in their day-to-day, hit on an optimation of an existing algorithm. Nor is it likely they'll be discovering wonderful new money-saving algorithms for your business.

If you're a pure research lab, employing PhD candidates and PhDs, maybe this kind of interview does indicate the required skills.

1 more reply

someplaceguy1y ago

I found this part of the code quite funny:

    # If there are < 5 items, just return the median
    if len(l) < 5:
        # In this case, we fall back on the first median function we wrote.
        # Since we only run this on a list of 5 or fewer items, it doesn't
        # depend on the length of the input and can be considered constant
        # time.
        return nlogn_median(l)

Hell, why not just use 2^140 instead of 5 as the cut-off point, then? This way you'd have constant time median finding for all arrays that can be represented in any real-world computer! :) [1]

[1] According to https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/

raincole1y ago

Big-O notation and "real-world computer" don't belong to the same sentence. The whole point of big-O notation is to abstract the algorithm out of real-world limitations so we can talk about arbitrarily large input.

Any halting program that runs on a real world computer is O(1), by definition.

someplaceguy1y ago

> The whole point of big-O notation is to abstract the algorithm out of real-world limitations so we can talk about arbitrarily large input.

Except that there is no such thing as "arbitrarily large storage", as my link in the parent comment explained: https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/

So why would you want to talk about arbitrarily large input (where the input is an array that is stored in memory)?

As I understood, this big-O notation is intended to have some real-world usefulness, is it not? Care to elaborate what that usefulness is, exactly? Or is it just a purely fictional notion in the realm of ideas with no real-world application?

And if so, why bother studying it at all, except as a mathematical curiosity written in some mathematical pseudo-code rather than a programming or engineering challenge written in a real-world programming language?

Edit: s/pretending/intended/

fenomas1y ago

Big-O analysis is about scaling behavior - its real-world implications lie in what it tells you about relative sizes, not absolute sizes.

E.g., if you need to run a task on 10M inputs, then knowing that your algorithm is O(N) doesn't tell you anything at all about how long your task will take. It also doesn't tell you whether that algorithm will be faster than some other algorithm that's O(N^2).

But it does tell you that if your task size doubles to 20M inputs, you can expect the time required for the first algorithm to double, and the second to quadruple. And that knowledge isn't arcane or theoretical, it works on real-world hardware and is really useful for modeling how your code will run as inputs scale up.

1 more reply

raincole1y ago

> (where the input is an array that is stored in memory)?

If the input is an array that is stored in a piece of real-world memory, then the only possible complexity is O(1). It's just how big-O works. Big-O notation is an abstraction that is much much closer to mathematics than to engineering.

> this big-O notation is pretending to have some real-world usefulness...

Big-O notation is not a person so I'm not sure whether it can pretend something. CS professors might exaggerate big-O notation's real-world usefulness so their students don't fall asleep too fast.

> fictional

Theoretical. Just like the other theoretical ideas, at best big-O notation gives some vague insights that help people solve real problems. It gives a very rough feeling about whether an algorithm is fast or not.

By the way, Turing machine is in this category as well.

SkiFire131y ago

Ultimately when an algorithm has worse complexity than another it might still be faster up to a certain point. In this case 5 is likely under that point, though I doubt 2^256 will.

In practice you might also want to use a O(n^2) algorithm like insertion sort under some threshold.

someplaceguy1y ago

> Ultimately when an algorithm has worse complexity than another it might still be faster up to a certain point.

Sure, but the author didn't argue that the simpler algorithm would be faster for 5 items, which would indeed make sense.

Instead, the author argued that it's OK to use the simpler algorithm for less than 5 items because 5 is a constant and therefore the simpler algorithm runs in constant time, hence my point that you could use the same argument to say that 2^140 (or 2^256) could just as well be used as the cut-off point and similarly argue that the simpler algorithm runs in constant time for all arrays than can be represented on a real-world computer, therefore obviating the need for the more complex algorithm (which obviously makes no sense).

thrw24867761y ago

If you set n=2^140, then sure, it’s constant. If instead you only have n<=2^140, then n varies across a large set and is practically indistinguishable from n<=infinity (since we get into the territory of the number of atoms in the universe), therefore you can perform limit calculations on it, in particular big O stuff.

In the article n was set to 5. All of those arrays (except maybe 1) have exactly 5 elements. There is no variance (and even if there was, it would be tiny, there is no point in talking about limits of 5-element sequences).

1 more reply

ignoramous1y ago

If an approximation is enough, the p2 quantile estimator (O(1) memory) is pretty neat: https://news.ycombinator.com/item?id=25201093

saagarjha1y ago

This is hinted at in the post but if you're using C++ you will typically have access to quickselect via std::nth_element. I've replaced many a sort with that in code review :) (Well, not many. But at least a handful.)

conradludgate1y ago

Same with rust, there's the `select_nth_unstable` family on slices that will do this for you. It uses a more fancy pivot choosing algorithm but will fall back to median-of-medians if it detects it's taking too long

chpatrick1y ago

Another nice one is O(1) weighted sampling (after O(n) preprocessing).

https://en.wikipedia.org/wiki/Alias_method

melonmouse1y ago

The linked proof for that median of medians is O(n) feels counterintuitive to me. Here's a (simpler?) alternative.

  T(0) = 0
  T(1) = 1
  T(n) = n + T(n/5) + T(7/10*n)

We want to prove that:

  T(n) ≤ C*n

It is intuitive that T(a+b) ≥ T(a) + T(b), or in other words, T is superadditive. That can be shown by induction:

Induction base: it holds for all a+b < 1, the only case being a=0, b=0:

  T(0+0) = 0 + T(0) + T(0) ≥ T(0) + T(0)

Induction step: suppose it holds for all a+b < k. Let a+b = k.

  T(a+b) = T(k)
         = k + T(k/5) + T(7/10*k)
         ≥ k + T(a/5) + T(b/5) + T(7/10*a) + T(7/10*b)
         = [a + T(a/5) + T(7/10*a)] + [b + T(b/5) + T(7/10*b)]
         = T(a) + T(b)

Because T is superadditive:

  T(n) = n + T(n/5) + T(7/10*n)
       ≤ n + T(n/5 + 7/10*n)
       = n + T(9/10*n)

Now we can apply the master theorem. Or to write out the proof (using a geometric series):

  T(n) ≤ n + T(9/10*n)
       ≤ n * ∑ᵢ₌₀ᶦⁿᶠᶦⁿᶦᵗʸ (9/10)^i
       = n * 1/(1-9/10)
       = 10*n

So, we have shown the algorithm is O(n) with C=10 (or less).

beyondCritics1y ago

I like the idea to use super additivity, but in a proof you cannot creatively extend T to the reals, this should be fixed.

Here is the slightly mopped up proof i had in mind, when i posted my hints below:

  Let be r>=1 and 0<a(i) for all 1<=i<=r and 1/a(1) + ... + 1/a(n) =: s < 1.
  Then a(i) > 1 for all 1 <= i <= r. 

  Let be c > 0 and
  T(0) := 0
  T(n) := c \* n + T(floor(n/a(1))) + ... + T(floor(n/a(r)))

  Then T(n) <= b * n for all n with b := c/(1-s) > 0 !
  Proof by induction: 
  "n=0" : 
   The statement holds trivially.

  "k->n": 
   Let n>=1 and assume the statement holds for all 0<=k<n. 
   Now since a(i)>1 we have floor(n/a(i)) <= n/a(i) < n. By the induction hypothesis therefore
   T(floor(n/a(i))) <= b * floor(n/a(i)) <= b * n/a(i). 
   Apply this to get:
   T(n) =  c * n + T(floor(n/a(1))) + ... + T(floor(n/a(r)))
        <= c * n + b * n/a(1) + ... +  b * n/a(r)
        = (c + b*s) * n
        = b * n.
   Hence T(n) <= b * n.

hammeiam1y ago

The "Split the array into subarrays of length 5, now sorting all of the arrays is O(n) instead of O(n log n)" feels like cheating to me

marcosdumay1y ago

O(n log 5) is O(n). There's no cheating, sorting small arrays in a list is a completely different problem from sorting a large array.

tptacek1y ago

They're not sorting all the arrays?

Later

(i was going to delete this comment, but for posterity, i misread --- sorting the lists, not the contents of the list, sure)

1 more reply

IncreasePosts1y ago

It would only be cheating if you could merge the arrays in O(1), which you can't.

hammeiam1y ago

ahh this is the insight I was missing, thank you!

Sharlin1y ago

It’s unambiguously O(n), there’s no lg n anywhere to be seen. It may be O(n) with a bit larger constant factor, but the whole point of big-O analysis is that those don’t matter.

pfortuny1y ago

Actually lots of algorithms "feel" like cheating until you understand what you were not looking at (fast matrix multiplication, fast fourier transforms...).

Someone1y ago

FTA:

“Proof of Average O(n)

On average, the pivot will split the list into 2 approximately equal-sized pieces. Therefore, each subsequent recursion operates on 1⁄2 the data of the previous step.”

That “therefore” doesn’t follow, so this is more an intuition than a proof. The problem with it is that the medium is more likely to end up in the larger of the two pieces, so you more likely have to recurse on the larger part than on the smaller part.

What saves you is that O(n) doesn’t say anything about constants.

Also, I would think you can improve things a bit for real world data by, on subsequent iterations, using the average of the set as pivot (You can compute that for both pieces on the fly while doing the splitting. The average may not be in the set of items, but that doesn’t matter for this algorithm). Is that true?

meatmanek1y ago

If I'm understanding correctly, the median is actually guaranteed to be in the larger of the two pieces of the array after partitioning. That means on average you'd only discard 25% of the array after each partition. Your selected pivot is either below the median, above the median, or exactly the median. If it's below the median, it could be anywhere in the range [p0, p50) for an average of around p25; if it's above the median, it could be anywhere in the range (p50, p100] for an average of around p75.

Since these remaining fractions combine multiplicatively, we actually care about the geometric mean of the remaining fraction of the array, which is e^[(integral of ln(x) dx from x=0.5 to x=1) / (1 - 0.5)], or about 73.5%.

Regardless, it forms a geometric series, which should converge to 1/(1-0.735) or about 3.77.

Regarding using the average as the pivot: the question is really what quantile would be equal to the mean for your distribution. Heavily skewed distributions would perform pretty badly. It would perform particularly badly on 0.01*np.arange(1, 100) -- for each partitioning step, the mean would land between the first element and the second element.

Someone1y ago

> If I'm understanding correctly, the median is actually guaranteed to be in the larger of the two pieces of the array after partitioning.

Only in the first iteration. There’s a good chance it will be in the smaller one in the second iteration, for example.

So, your analysis is a bit too harsh, but probably good enough for a proof that it’s O(n) on average.

> Heavily skewed distributions would perform pretty badly

That’s why I used the weasel worlds “real world data” ;-)

I also thought about mentioning that skew can be computed streaming (see for example https://www.boost.org/doc/libs/1_53_0/doc/html/accumulators/...), but even if you have that, there still are distributions that will perform badly.

sfpotter1y ago

A nice way to approximate the median: https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf

RcouF1uZ4gsC1y ago

> The C++ standard library uses an algorithm called introselect which utilizes a combination of heapselect and quickselect and has an O(nlogn) bound.

Introselect is a combination of Quickselect and Median of Medians and is O(n) worst case.

Tarean1y ago

Love this algorithm. It feels like magic, and then it feels obvious and basically like binary search.

Similar to the algorithm to parallelize the merge step of merge sort. Split the two sorted sequences into four sequences so that `merge(left[0:leftSplit], right[0:rightSplit])+merge(left[leftSplit:], right[rightSplit:])` is sorted. leftSplit+rightSplit should be halve the total length, and the elements in the left partition must be <= the elements in the right partition.

Seems impossible, and then you think about it and it's just binary search.

teo_zero1y ago

> On average, the pivot will split the list into 2 approximately equal-sized pieces.

Where does this come from?

Even assuming a perfect random function, this would be true only for distributions that show some symmetry. But if the input is all 10s and one 5, each step will generate quite different-sized pieces!

paldepind21y ago

I think you answered your own question. It's the standard average-time analysis of Quicksort and the (unmentioned) assumption is that the numbers are from some uniform distribution.

Why would the distribution have to be symmetric? My intuition is that if you sample n numbers from some distribution (even if it's skewed) and pick a random number among the n numbers, then on average that number would be separate the number into two equal-sized sets. Are you saying that is wrong?

teo_zero1y ago

With real numbers, I have the same intuition. But with integers, where 2 or more elements can be exactly the same, and with the two sets defined as they are defined in TFA, that is one "less than" and one "greater or equal", then I'd argue that the second set will be bigger than the former.

In the pathological case where all the elements are the same value, one set will always be empty and the algorithm will not even terminate.

In a less extreme case where nearly all the items are the same except a few ones, then the algorithm will slowly advance, but not with the progression n, n/2, n/4, etc. that is needed to prove it's O(n).

Please note that the "less extreme case" I depicted above is quite common in significant real-world statistics. For example, how many times a site is visited by unique users per day: a long sequence of 1s with some sparse numbers>1. Or how many children/cars/pets per family: many repeated small numbers with a few sparse outliers. Etc.

runiq1y ago

Why is it okay to drop not-full chunks? The article doesn't explain that and I'm stupid.

Edit: I just realized that the function where non-full chunks are dropped is just the one for finding the pivot, not the one for finding the median. I understand now.

ValleZ1y ago

I was asked to invent this algorithm on a whiteboard in 30 minutes. Loved it.

beyondCritics1y ago

<It’s not straightforward to prove why this is O(n).

Replace T(n/5) with T(floor(n/5)) and T(7n/10) with T(floor(7n/10)) and show by induction that T(n) <= 10n for all n.

kccqzy1y ago

> Quickselect gets us linear performance, but only in the average case. What if we aren’t happy to be average, but instead want to guarantee that our algorithm is linear time, no matter what?

I don't agree with the need for this guarantee. Note that the article already says the selection of the pivot is by random. You can simply choose a very good random function to avoid an attacker crafting an input that needs quadratic time. You'll never be unlucky enough for this to be a problem. This is basically the same kind of mindset that leads people into thinking, what if I use SHA256 to hash these two different strings to get the same hash?

mitthrowaway21y ago

It's a very important guarantee for use in real-time signal processing applications.

forrestthewoods1y ago

> I don't agree with the need for this guarantee.

You don’t get to agree with it or not. It depends on the project! Clearly there exist some projects in the world where it’s important.

But honestly it doesn’t matter. Because as the article shows with random data that median-of-medians is strictly better than random pivot. So even if you don’t need the requirement there is zero loss to achieve it.

kccqzy1y ago

The median-of-median comes at a cost for execution time. Chances are, sorting each five-element chunk is a lot slower than even running a sophisticated random number generator.

Quekid51y ago

Slowness (lower throughput) is often the tradeoff for more predictable run time.

forrestthewoods1y ago

Did you read the article? Median-of-median results in fewer comparisons than random.

zelphirkalt1y ago

You can simply pass once over the data, and while you do that, count occurrences of the elements, memorizing the last maximum. Whenever an element is counted, you check, if that count is now higher than the previous maximum. If it is, you memorize the element and its count as the maximum, of course. Very simple approach and linear in time, with minimal book keeping on the way (only the median element and the count (previous max)).

I don't find it surprising or special at all, that finding the median works in linear time, since even this ad-hoc thought of way is in linear time.

EDIT: Ah right, I mixed up mode and median. My bad.

gcr1y ago

This finds the mode (most common element), not the median.

Wouldn't you also need to keep track of all element counts with your approach? You can't keep the count of only the second-most-common element because you don't know what that is yet.

zelphirkalt1y ago

Yes, you are right. I mixed up mode and median.

And yes, one would need to keep track of at least a key for each element (not a huge element, if they are somehow huge). But that would be about space complexity.

gcr1y ago

pardon! it's fun to think about though!

vismit20001y ago

This is covered in section 9.3 in CLRS book - Medians and Order Statistics

SkiFire131y ago

I wonder what's the reason of picking groups of 5 elements instead of 2 or 8.

danlark1y ago

3 and 4 elements will fail to prove the complexity is linear

You still can do 3 or 4 but with slight modifications

https://arxiv.org/abs/1409.3600

For example, for 4 elements, it's advised to take lower median for the first half and upper median for the second half. Then the complexity will be linear

lalaland11251y ago

1. You want an odd number so the median is the middle element of the sublist.

2. One and three are probably too small

nilslindemann1y ago

"ns" instead of "l" and "n" instead of "el" would have been my choice (seen in Haskell code).

robinhouston1y ago

The trouble with using this convention (which I also like) in Python code is that sooner or later one wants to name a pair of lists 'as' and 'bs', which then causes a syntax error because 'as' is a keyword in Python. There is a similar problem with 'is' and 'js'.

nilslindemann1y ago

Sure, naming is hard, but avoid "l", "I", "O", "o".

Very short variable names (including "ns" and "n") are always some kind of disturbance when reading code, especially when the variable lasts longer than one screen of code – one has to memorize the meaning. They sometimes have a point, e.g. in mathematical code like this one. But variables like "l" and "O" are bad for a further reason, as they can not easily be distinguished from the numbers. See also the Python style guide: https://peps.python.org/pep-0008/#names-to-avoid

j / k navigate · click thread line to collapse

181 comments

danlark1y ago

Around 4 years ago I compared lots of different median algorithms and the article turned out to be much longer than I anticipated :)

https://danlark.org/2020/11/11/miniselect-practical-and-gene...

thanatropism1y ago

Is any of those easily modifiable to return the arg-median (the index which has the median).

cfors1y ago

Just wanted to say thank you for this article - I've read and shared this a few times over the years!

rented_mule1y ago

justinpombrio1y ago

(I made the number 10,000 up, but you could do some statistics to figure out how many samples would be needed for a given level of confidence, and I don't think it would be prohibitively large.)

rented_mule1y ago

jiggawatts1y ago

Speaking of "single pass", one of the criticisms I have of the "enumerator" patterns in modern programming languages is that they encourage multiple passes.

As an example: computing the .min() and .max() of an enumerable is two passes even though it could be done with one pass.

I'd love to see a language embrace a more efficient style similar to how a SQL does it, where you can elegantly request this as a single pass over the data: "SELECT min(x), max(x) FROM y"

3 more replies

sroussey1y ago

Actually, seeking the bias numbers can be quite illuminating.

1 more reply

enriquto1y ago

> the latter is much easier: sample 10,000 elements uniformly at random and take their median

Do you have a source for that claim?

True, in that case any point between say -90 and 90 would be equally good as a median in most applications. But this does not mean that the median can be found accurately by your method.

B1FF_PSUVM1y ago

> this does not mean that the median can be found accurately by your method.

You can do dynamic sampling: e.g. take double the samples, see what decimal in your result budges. Adjust.

rhymer1y ago

Asymptotic properties of quantile estimators are widely studied [1]. The key is to have a sufficiently large sample size.

[1] Bahadur, R. R. (1966). A note on quantiles in large samples. Annals of Mathematical Statistics, 37, 577–580.

1 more reply

maronato1y ago

https://en.m.wikipedia.org/wiki/Non-uniform_random_variate_g...

justinpombrio1y ago

That's a good point, though, that the 49th percentile and 51st percentile can be arbitrarily far from the median.

andruby1y ago

I was thinking the same thing.

In all use-cases I've seen a close estimate of the median was enough.

hhmc1y ago

You also can use the fact that for any distribution, the median is never further than 1SD away from the mean.

digaozao1y ago

Filligree1y ago

Was this by any chance for generating availability metrics, and were you an intern at the time? The system sounds, ah, very familiar.

rented_mule1y ago

The metrics were about speed. And I was decades past my last internship at the time in question. But, as is so often the case, more than one of us may have been reinventing pretty similar wheels. :)

ant6n1y ago

I’m not sure why you use a dictionary with keys 0…999, instead of an array indexed 0…999.

rented_mule1y ago

tomrod1y ago

That's just a dict/map with less flexibility on the keys :D

ashton3141y ago

Where were you working? Sounds like you got lucky to work on some fun problems!

rented_mule1y ago

Sorry, but I'm trying to keep this account relatively anonymous to sidestep some of my issues with being shy.

ashton3141y ago

> I'm trying to keep this account relatively anonymous

I totally respect that—don't let me pressure you any further.

Yeah, sounds like you had the ideal work setup: cool problem requiring cool technical solutions with good management. That's wonderful!

xinok1y ago

He also gave a talk about his algorithm in 2016. He's an entertaining presenter, I highly recommended!

There's Treasure Everywhere - Andrei Alexandrescu

https://www.youtube.com/watch?v=fd1_Miy1Clg

_yid91y ago

Andrei Alexandrescu is awesome; around 2000 he gave on talk on lock-free wait-free algorithms that I immediately applied to a huge C++ industrial control networking project at the time.

I'd recommend anyone who writes software listening and reading anything of Andrei's you can find; this one is indeed a Treasure!

fasa991y ago

that's wild, a bit of a polymath by computer science standards. I know him from template metaprogramming fame and here he is shifting from programming languages to algorithms

mabbo1y ago

The lesson was so simple yet so profound that I nearly applied to grad school because of it. I have no idea if they even recall the conversation, but it was a pivotal moment of my education.

zelphirkalt1y ago

blt1y ago

I am not the original commenter, but I (and probably many CS students) have had similar moments of clarity. The key part for me isn't

> there might be an even faster linear algorithm,

but

> it's possible to do selection of an unsorted list in O(n) time. At one point, we didn't know whether that was even possible.

anonymoushn1y ago

mrguyorama1y ago

I think we got a constant factor of 22 for this algorithm so maybe it was a related one or something.

kwantam1y ago

One of the fun things about the median-of-medians algorithm is its completely star-studded author list.

Manuel Blum - Turing award winner in 1995

Robert Floyd - Turing award winner in 1978

Ron Rivest - Turing award winner in 2002

Bob Tarjan - Turing award winner in 1986 (oh and also the inaugural Nevanlinna prizewinner in 1982)

Four independent Turing awards! SPARCstations! This paper has it all.

jiggawatts1y ago

ted_dunning1y ago

And if you really want to impress, you reach into your pack and pull out the pens you carry just in case you run into dry pens at a critical moment.

jiggawatts1y ago

Munksgaard1y ago

Here's a direct link for anyone who, like me, would be interested in reading the original article: https://people.csail.mit.edu/rivest/pubs/BFPRT73.pdf

That's an impressive list of authors, for sure.

praptak1y ago

Some other awesome stuff by Pratt:

Pratt parsing (HN discussion: https://news.ycombinator.com/item?id=39066465), the "P" in the KMP algorithm.

someplaceguy1y ago

    return l[len(l) / 2]

I'm not a Python expert, but doesn't the `/` operator return a float in Python? Why would you use a float as an array index instead of doing integer division (with `//`)?

I know this probably won't matter until you have extremely large arrays, but this is still quite a code smell.

    def quickselect_median(l, pivot_fn=random.choice):
        if len(l) % 2 == 1:
            return quickselect(l, len(l) // 2, pivot_fn)
        else:
            return 0.5 * (quickselect(l, len(l) / 2 - 1, pivot_fn) +
                           quickselect(l, len(l) / 2, pivot_fn))

That we're 50 comments in and nobody seems to have noticed this only serves to reinforce my existing prejudice against the average Python code quality.

jononor1y ago

Well spotted! In Python 2 there was only one operator, but in Python 3 they are distinct. Indexing an array with a float raises an exception, I believe.

runeblaze1y ago

TacticalCoder1y ago

I really enjoyed TFA but this:

Our world is build on probabilities: all our private keys are protected by the mathematical improbability that someone shall guess them correctly.

From what I read, a shuffle followed by quickselect is O(n) for all practical purposes.

bo10241y ago

Reubend1y ago

> If adversarial input is a concern, doing a O(n) shuffle of the data first guarantees this cannot happen.

It doesn't guarantee that you avoid the worst case, it just removes the possibility of forcing the worst case.

furstenheim1y ago

Floyd Ryvest also does the job . A bit more efficient IIRC.

However I never managed to understand how it works.

https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...

throwaway2945311y ago

If you're selecting the n:th element, where n is very small (or large), using median-of-medians may not be the best choice.

If anyone has pointers to fast generic & in-place selection algorithms, I'm interested.

[1] https://doi.org/10.4230/LIPIcs.SEA.2017.24

mgaunard1y ago

You could also use one of the streaming algorithms which allow you to compute approximations for arbitrary quantiles without ever needing to store the whole data in memory.

cosmic_quanta1y ago

Personally I would gravitate towards the quickselect algorithm described in the OP until I was forced to consider a streaming median approximation method.

skrtskrt1y ago

A use case for something like this, and not just for medians, is where you have a querying/investigation UI like Grafana or Datadog or whatever.

conradludgate1y ago

SkiFire131y ago

Then you don't get a guaranteed O(n) complexity if the approximated algorithm happen to make bad choices

mgaunard1y ago

any approximation gives a bound on the error.

sevensor1y ago

I’ve definitely had situations where a streaming quantile algorithm would have been useful, do you have any references?

fanf21y ago

There are two kinds:

- quantile sketches, such as t-digest, which aim to control the quantile error or rank error. Apache DataSketches has several examples, https://datasketches.apache.org/docs/Quantiles/QuantilesOver...

ssfrr1y ago

1 more reply

throwaway_24941y ago

See also the Greenwald Khanna quantile estimator, an online algorithm which can compute any quantile within a given ϵ.

https://aakinshin.net/posts/greenwald-khanna-quantile-estima...

1 more reply

sevensor1y ago

Awesome, you did one! I’ll give it a read.

ssfrr1y ago

Here's a simple one I've used before. It's a variation on FAME (Fast Algorithm for Median Estimation) [1].

[1]: http://www.eng.tau.ac.il/~shavitt/courses/LargeG/streaming-m...

[2]: https://stats.stackexchange.com/a/445714

ted_dunning1y ago

The state of the art has moved well beyond these algorithms. See these

https://github.com/tdunning/t-digest https://www.sciencedirect.com/science/article/pii/S266596382... https://arxiv.org/pdf/2102.09299

And, as I mentioned else-thread, exponential histograms are the best choice in almost all practical situations.

sevensor1y ago

fzy951y ago

"Further analysis of the remedian algorithm" https://www.sciencedirect.com/science/article/pii/S030439751...

This one has a streaming variant.

sevensor1y ago

Thanks!!

anonymoushn1y ago

contravariant1y ago

It should also be noted that radix sort is ridiculously fast because it just scans linearly through the list each time.

anonymoushn1y ago

contravariant1y ago

exDM691y ago

Radix sort is awesome if k is small, N is huge and/or you are using a GPU. On a CPU, comparison based sorting is faster in most cases.

anonymoushn1y ago

No, it's O(N+M) where N is the number of strings and M is the sum of the lengths of the strings. Maybe your radix sort has some problems?

A very similar similar radix sort is included in https://github.com/alichraghi/zort which includes some benchmarks, but I haven't done the work to have it work on strings or arbitrary structs.

Aardwolf1y ago

> No, it's O(N+M) where N is the number of strings and M is the sum of the lengths of the strings.

EDIT: I think it does make sense, M is length*N, and in scenarios where this matters this length will be in the order of log(N) anyway so it's still NlogN-sh.

2 more replies

dataflow1y ago

1 more reply

ncruces1y ago

An implementation in Go, that's (hopefully) simple enough to be understandable, yet minimally practical:

https://github.com/ncruces/sort/blob/main/quick/quick.go

Xcelerate1y ago

I wrapped up the solution right at the time limit, and my code ran fine on the test cases. Was decently proud of myself for getting a solution in the allotted time.

geraldwhen1y ago

These are horrible interview questions. Data issues like this do pop up in the real world; heck I deal with them.

pfdietz1y ago

Xcelerate1y ago

jagged-chisel1y ago

1 more reply

creata1y ago

With 256 counters, you could use the same approach with four passes: pass i bins the numbers by byte i (0 = most sig., 3 = least sig.) and then identifies the bin that contains the median.

I really want to know what a one-pass, low-memory solution looks like, lol.

1 more reply

jiggawatts1y ago

Xcelerate1y ago

This company receives so many candidates that the interviewer would have just ended the call and moved on to the next candidate.

jagged-chisel1y ago

And we, as software engineers, should also take that advice: it's better to just cut your losses quickly and move on to the next option in the current market.

anonymoushn1y ago

Do you mean topK rather than median, for K small? You certainly cannot build a heap with trillions of items in it.

Xcelerate1y ago

No, I mean median. Here is an article describing a very similar problem since I can’t link to the leetcode version: https://www.geeksforgeeks.org/median-of-stream-of-running-in...

raincole1y ago

> Auxiliary Space : O(n).

> The Space required to store the elements in Heap is O(n).

I don't think this algorithm is suitable for trillions of items.

osti1y ago

I'm wondering what heap approach can solve that problem, as I can't think of any. Hopefully OP got a link to the thesis.

The n log n approach definitely works though.

Tarean1y ago

But that stores all elements into memory?

ted_dunning1y ago

That's a bullshit question.

This takes four passes and requires 256 x 8 byte counters plus incidentals.

In a single pass you can't get the exact answer.

jagged-chisel1y ago

> … I didn’t realize the interview task was to re-implement someone’s PhD thesis in 30 minutes...

ted_dunning1y ago

Banning stupid interview questions is a bad idea for job applicants since they are such a good indication of bullshit job culture.

jagged-chisel1y ago

Excellent point.

jagged-chisel1y ago

It's quicksort with a modification to select the median during the process. I feel like this is a good way to approach lots of "find $THING in list" questions.

mnw21cam1y ago

It's quicksort, but neglecting a load of the work that quicksort would normally have to do. Instead of recursing twice, leading to O(nlogn) behaviour, it's only recursing once.

KMag1y ago

If they knew about median-of-medians, they could probably just suggest introselect at the start, and move on to another question.

jagged-chisel1y ago

ETA: You said "used to" and I didn't acknowledge that. This is targeted at that kind of interview, not you directly.

---

If you're a pure research lab, employing PhD candidates and PhDs, maybe this kind of interview does indicate the required skills.

1 more reply

someplaceguy1y ago

I found this part of the code quite funny:

    # If there are < 5 items, just return the median
    if len(l) < 5:
        # In this case, we fall back on the first median function we wrote.
        # Since we only run this on a list of 5 or fewer items, it doesn't
        # depend on the length of the input and can be considered constant
        # time.
        return nlogn_median(l)

Hell, why not just use 2^140 instead of 5 as the cut-off point, then? This way you'd have constant time median finding for all arrays that can be represented in any real-world computer! :) [1]

[1] According to https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/

raincole1y ago

Any halting program that runs on a real world computer is O(1), by definition.

someplaceguy1y ago

> The whole point of big-O notation is to abstract the algorithm out of real-world limitations so we can talk about arbitrarily large input.

Except that there is no such thing as "arbitrarily large storage", as my link in the parent comment explained: https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/

So why would you want to talk about arbitrarily large input (where the input is an array that is stored in memory)?

Edit: s/pretending/intended/

fenomas1y ago

Big-O analysis is about scaling behavior - its real-world implications lie in what it tells you about relative sizes, not absolute sizes.

1 more reply

raincole1y ago

> (where the input is an array that is stored in memory)?

> this big-O notation is pretending to have some real-world usefulness...

Big-O notation is not a person so I'm not sure whether it can pretend something. CS professors might exaggerate big-O notation's real-world usefulness so their students don't fall asleep too fast.

> fictional

By the way, Turing machine is in this category as well.

SkiFire131y ago

Ultimately when an algorithm has worse complexity than another it might still be faster up to a certain point. In this case 5 is likely under that point, though I doubt 2^256 will.

In practice you might also want to use a O(n^2) algorithm like insertion sort under some threshold.

someplaceguy1y ago

> Ultimately when an algorithm has worse complexity than another it might still be faster up to a certain point.

Sure, but the author didn't argue that the simpler algorithm would be faster for 5 items, which would indeed make sense.

thrw24867761y ago

1 more reply

ignoramous1y ago

If an approximation is enough, the p2 quantile estimator (O(1) memory) is pretty neat: https://news.ycombinator.com/item?id=25201093

saagarjha1y ago

conradludgate1y ago

chpatrick1y ago

Another nice one is O(1) weighted sampling (after O(n) preprocessing).

https://en.wikipedia.org/wiki/Alias_method

melonmouse1y ago

The linked proof for that median of medians is O(n) feels counterintuitive to me. Here's a (simpler?) alternative.

  T(0) = 0
  T(1) = 1
  T(n) = n + T(n/5) + T(7/10*n)

We want to prove that:

  T(n) ≤ C*n

It is intuitive that T(a+b) ≥ T(a) + T(b), or in other words, T is superadditive. That can be shown by induction:

Induction base: it holds for all a+b < 1, the only case being a=0, b=0:

  T(0+0) = 0 + T(0) + T(0) ≥ T(0) + T(0)

Induction step: suppose it holds for all a+b < k. Let a+b = k.

  T(a+b) = T(k)
         = k + T(k/5) + T(7/10*k)
         ≥ k + T(a/5) + T(b/5) + T(7/10*a) + T(7/10*b)
         = [a + T(a/5) + T(7/10*a)] + [b + T(b/5) + T(7/10*b)]
         = T(a) + T(b)

Because T is superadditive:

  T(n) = n + T(n/5) + T(7/10*n)
       ≤ n + T(n/5 + 7/10*n)
       = n + T(9/10*n)

Now we can apply the master theorem. Or to write out the proof (using a geometric series):

  T(n) ≤ n + T(9/10*n)
       ≤ n * ∑ᵢ₌₀ᶦⁿᶠᶦⁿᶦᵗʸ (9/10)^i
       = n * 1/(1-9/10)
       = 10*n

So, we have shown the algorithm is O(n) with C=10 (or less).

beyondCritics1y ago

I like the idea to use super additivity, but in a proof you cannot creatively extend T to the reals, this should be fixed.

Here is the slightly mopped up proof i had in mind, when i posted my hints below:

  Let be r>=1 and 0<a(i) for all 1<=i<=r and 1/a(1) + ... + 1/a(n) =: s < 1.
  Then a(i) > 1 for all 1 <= i <= r. 

  Let be c > 0 and
  T(0) := 0
  T(n) := c \* n + T(floor(n/a(1))) + ... + T(floor(n/a(r)))

  Then T(n) <= b * n for all n with b := c/(1-s) > 0 !
  Proof by induction: 
  "n=0" : 
   The statement holds trivially.

  "k->n": 
   Let n>=1 and assume the statement holds for all 0<=k<n. 
   Now since a(i)>1 we have floor(n/a(i)) <= n/a(i) < n. By the induction hypothesis therefore
   T(floor(n/a(i))) <= b * floor(n/a(i)) <= b * n/a(i). 
   Apply this to get:
   T(n) =  c * n + T(floor(n/a(1))) + ... + T(floor(n/a(r)))
        <= c * n + b * n/a(1) + ... +  b * n/a(r)
        = (c + b*s) * n
        = b * n.
   Hence T(n) <= b * n.

hammeiam1y ago

The "Split the array into subarrays of length 5, now sorting all of the arrays is O(n) instead of O(n log n)" feels like cheating to me

marcosdumay1y ago

O(n log 5) is O(n). There's no cheating, sorting small arrays in a list is a completely different problem from sorting a large array.

tptacek1y ago

They're not sorting all the arrays?

Later

(i was going to delete this comment, but for posterity, i misread --- sorting the lists, not the contents of the list, sure)

1 more reply

IncreasePosts1y ago

It would only be cheating if you could merge the arrays in O(1), which you can't.

hammeiam1y ago

ahh this is the insight I was missing, thank you!

Sharlin1y ago

It’s unambiguously O(n), there’s no lg n anywhere to be seen. It may be O(n) with a bit larger constant factor, but the whole point of big-O analysis is that those don’t matter.

pfortuny1y ago

Actually lots of algorithms "feel" like cheating until you understand what you were not looking at (fast matrix multiplication, fast fourier transforms...).

Someone1y ago

FTA:

“Proof of Average O(n)

On average, the pivot will split the list into 2 approximately equal-sized pieces. Therefore, each subsequent recursion operates on 1⁄2 the data of the previous step.”

What saves you is that O(n) doesn’t say anything about constants.

meatmanek1y ago

Regardless, it forms a geometric series, which should converge to 1/(1-0.735) or about 3.77.

Someone1y ago

> If I'm understanding correctly, the median is actually guaranteed to be in the larger of the two pieces of the array after partitioning.

Only in the first iteration. There’s a good chance it will be in the smaller one in the second iteration, for example.

So, your analysis is a bit too harsh, but probably good enough for a proof that it’s O(n) on average.

> Heavily skewed distributions would perform pretty badly

That’s why I used the weasel worlds “real world data” ;-)

sfpotter1y ago

A nice way to approximate the median: https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf

RcouF1uZ4gsC1y ago

> The C++ standard library uses an algorithm called introselect which utilizes a combination of heapselect and quickselect and has an O(nlogn) bound.

Introselect is a combination of Quickselect and Median of Medians and is O(n) worst case.

Tarean1y ago

Love this algorithm. It feels like magic, and then it feels obvious and basically like binary search.

Seems impossible, and then you think about it and it's just binary search.

teo_zero1y ago

> On average, the pivot will split the list into 2 approximately equal-sized pieces.

Where does this come from?

paldepind21y ago

I think you answered your own question. It's the standard average-time analysis of Quicksort and the (unmentioned) assumption is that the numbers are from some uniform distribution.

teo_zero1y ago

In the pathological case where all the elements are the same value, one set will always be empty and the algorithm will not even terminate.

runiq1y ago

Why is it okay to drop not-full chunks? The article doesn't explain that and I'm stupid.

Edit: I just realized that the function where non-full chunks are dropped is just the one for finding the pivot, not the one for finding the median. I understand now.

ValleZ1y ago

I was asked to invent this algorithm on a whiteboard in 30 minutes. Loved it.

beyondCritics1y ago

<It’s not straightforward to prove why this is O(n).

Replace T(n/5) with T(floor(n/5)) and T(7n/10) with T(floor(7n/10)) and show by induction that T(n) <= 10n for all n.

kccqzy1y ago

> Quickselect gets us linear performance, but only in the average case. What if we aren’t happy to be average, but instead want to guarantee that our algorithm is linear time, no matter what?

mitthrowaway21y ago

It's a very important guarantee for use in real-time signal processing applications.

forrestthewoods1y ago

> I don't agree with the need for this guarantee.

You don’t get to agree with it or not. It depends on the project! Clearly there exist some projects in the world where it’s important.

kccqzy1y ago

The median-of-median comes at a cost for execution time. Chances are, sorting each five-element chunk is a lot slower than even running a sophisticated random number generator.

Quekid51y ago

Slowness (lower throughput) is often the tradeoff for more predictable run time.

forrestthewoods1y ago

Did you read the article? Median-of-median results in fewer comparisons than random.

zelphirkalt1y ago

I don't find it surprising or special at all, that finding the median works in linear time, since even this ad-hoc thought of way is in linear time.

EDIT: Ah right, I mixed up mode and median. My bad.

gcr1y ago

This finds the mode (most common element), not the median.

Wouldn't you also need to keep track of all element counts with your approach? You can't keep the count of only the second-most-common element because you don't know what that is yet.

zelphirkalt1y ago

Yes, you are right. I mixed up mode and median.

And yes, one would need to keep track of at least a key for each element (not a huge element, if they are somehow huge). But that would be about space complexity.

gcr1y ago

pardon! it's fun to think about though!

vismit20001y ago

This is covered in section 9.3 in CLRS book - Medians and Order Statistics

SkiFire131y ago

I wonder what's the reason of picking groups of 5 elements instead of 2 or 8.

danlark1y ago

3 and 4 elements will fail to prove the complexity is linear

You still can do 3 or 4 but with slight modifications

https://arxiv.org/abs/1409.3600

For example, for 4 elements, it's advised to take lower median for the first half and upper median for the second half. Then the complexity will be linear

lalaland11251y ago

1. You want an odd number so the median is the middle element of the sublist.

2. One and three are probably too small

nilslindemann1y ago

"ns" instead of "l" and "n" instead of "el" would have been my choice (seen in Haskell code).

robinhouston1y ago

nilslindemann1y ago

Sure, naming is hard, but avoid "l", "I", "O", "o".

j / k navigate · click thread line to collapse