undefined | Better HN

0 pointsmd_6y ago0 comments

How does that compare to just doing compression on a sample of data? :)

0 comments

Perfect interview questions to get to the juice details haha!

The problem is that choice of compression is very much dependent on the sample size, so this is why just choosing the algorithm based on running benchmarks on the sample will be off.

However, computing certain statistics on the sample and then doing machine learning makes a lot of sense!

Obviously not for simple/cheap to compute features such as number of rows/columns which actually do not take longer to compute as the data gets larger.

But I actually do this for predicting the uniqueness of values! You basically want to get an idea of how many unique values you have per column. But.. if you take a sample this will give you completely wrong numbers.

You can see my approach here: https://github.com/kootenpv/shrynk/blob/6a8675061d82aa65fc3b... Basically the formula calculates the uniqueness on a sample (e.g. 10000 rows), but then extrapolates the result to your actual data size. E.g. it finds 100 unique values in a sample of 10000, but in reality that means you maybe have 500 unique values.

ma2rten6y ago

Running the compression algorithms is O(kn) where k is the number of compression algorithms. Taking the machine learning approach is O(n).

md_OP6y ago

As the author noted after you posted this, it’s not a given that the ML algorithm is O(n). It may be constant time (by looking only at column headers and a sample of data, say).

That said, I was really more interested in practical runtimes. Like, in practice the ML may have a high startup cost (e.g., due to cost of loading a model), whereas for most sized datasets linear complexity may be fine...

kootenpv6y ago

Models are cached and not large so the setup time is very low. I'll time it when I get to a PC

j / k navigate · click thread line to collapse

0 comments

kootenpv6y ago

Perfect interview questions to get to the juice details haha!

The problem is that choice of compression is very much dependent on the sample size, so this is why just choosing the algorithm based on running benchmarks on the sample will be off.

However, computing certain statistics on the sample and then doing machine learning makes a lot of sense!

Obviously not for simple/cheap to compute features such as number of rows/columns which actually do not take longer to compute as the data gets larger.

ma2rten6y ago

Running the compression algorithms is O(kn) where k is the number of compression algorithms. Taking the machine learning approach is O(n).

md_OP6y ago

As the author noted after you posted this, it’s not a given that the ML algorithm is O(n). It may be constant time (by looking only at column headers and a sample of data, say).

kootenpv6y ago

Models are cached and not large so the setup time is very low. I'll time it when I get to a PC

j / k navigate · click thread line to collapse