How to shuffle a big dataset (2018) (opens in new tab)

(blog.janestreet.com)

20 pointsastdb2y ago14 comments

14 comments

My immediate thought before reading the article is that shuffling is the inverse of sorting. To put another way, you can take a sorted list and apply a merge sort on it, but the comparison operation is random. Merge sort can work on data sets that are too large to fit in memory since you sort small sized blocks that do fit in memory, then merge using streaming

CodesInChaos2y ago

You can't just randomize the comparison operation, since that'd violate the requirements of a comparison function. MS made that mistake in their browser download dialog, which led to significant biases. You'd need to assign a random number to each item. But then you'd need to figure out a way to store this random number.

Also, for in-memory data comparison-based sorting is much slower than shuffling. Both asymptotically (O(n log n) vs O(n)) and in practice. I did not investigate if sorting algorithms not based on comparisons (e.g. radix sort) have competitive performance.

https://www.computerworld.com/article/2762287/microsoft-s-eu...

ndriscoll2y ago

If you wanted to go down that line of thinking, you should be able to use a hash function to uniformly shuffle the keys within a larger space, and then apply your sort to that. No need to store the randomized keys.

Probably performance wouldn't be as good as the OP article though.

anonymoushn2y ago

The technique in TFA is very similar to radix sort.

rcme2y ago

Shuffling and sorting don’t seem related at all. Naively, I would expect shuffle to be linear in time complexity.

anonymoushn2y ago

Sorting too :)

pbourke2y ago

If you have a system with a fast sort/order by operation but a slow (or missing) shuffle, one trick that I sometimes use is to generate a new random number for each record, then sort by that random number.

Uniform sampling is also quite easy (and principled) - you can just take the first n records.

dmurvihill2y ago

I just finished an implementation of this algorithm for node.js: https://www.npmjs.com/package/big-shuffle

CipherThrowaway2y ago

I did not see it mentioned in the article, but I wonder why or if the team rejected a permutation function on the index. The shuffled dataset would then be Xs[i] = X[f(i)]

anonymoushn2y ago

Do you have a cheap permutation function for large n? It seems like you still have to do it in two passes if you do this.

One reason cited in TFA for the half-shuffled files approach is that it's easy to rotate old data out of and new data into the half-shuffled files.

CodesInChaos2y ago

There are format preserving encryption algorithms which act as a pseudo random permutation of integers with a variable upper bound. These slower than Fisher-Yates for in memory data, but for on disk data the random access overhead should exceed their cost. The advantage of this approach is minimal memory use and no need to modify the data.

The downside is that it still needs one random read access per element, so cache friendly hierarchical algorithms, like the one described in the post, are probably still faster for on disk data.

CipherThrowaway2y ago

You could flip the order and do a sequential read from the source with a random write to the destination. It sounds like this would still suffer from poor performance on the data mentioned in the article, but it is hard to feel like some kind of opportunity hasn't been missed with index permutation.

For example, using the permuted the index mod M rather than random draw in the first pass could avoid the whole issue of oversized piles and the extra wasted space per pile.

CipherThrowaway2y ago

The permutation function is essentially a block cipher on the index. It's stateless and there is no need for intermediate files or buffers. The cost of running the function should be comparable to RNG.

ackbar032y ago

Can you randomize memory pointers or something?

j / k navigate · click thread line to collapse

14 comments

conradludgate2y ago

CodesInChaos2y ago

https://www.computerworld.com/article/2762287/microsoft-s-eu...

ndriscoll2y ago

Probably performance wouldn't be as good as the OP article though.

anonymoushn2y ago

The technique in TFA is very similar to radix sort.

rcme2y ago

Shuffling and sorting don’t seem related at all. Naively, I would expect shuffle to be linear in time complexity.

anonymoushn2y ago

Sorting too :)

pbourke2y ago

Uniform sampling is also quite easy (and principled) - you can just take the first n records.

dmurvihill2y ago

I just finished an implementation of this algorithm for node.js: https://www.npmjs.com/package/big-shuffle

CipherThrowaway2y ago

I did not see it mentioned in the article, but I wonder why or if the team rejected a permutation function on the index. The shuffled dataset would then be Xs[i] = X[f(i)]

anonymoushn2y ago

Do you have a cheap permutation function for large n? It seems like you still have to do it in two passes if you do this.

One reason cited in TFA for the half-shuffled files approach is that it's easy to rotate old data out of and new data into the half-shuffled files.

CodesInChaos2y ago

The downside is that it still needs one random read access per element, so cache friendly hierarchical algorithms, like the one described in the post, are probably still faster for on disk data.

CipherThrowaway2y ago

For example, using the permuted the index mod M rather than random draw in the first pass could avoid the whole issue of oversized piles and the extra wasted space per pile.

CipherThrowaway2y ago

ackbar032y ago

Can you randomize memory pointers or something?

j / k navigate · click thread line to collapse