undefined | Better HN

0 pointstroupe7y ago0 comments

Years ago I tried this and basically ended up proving that if Pi is random and you are "compressing" random data, on average the start and finish numbers together are at least as long as the numbers you are trying to "compress."

0 comments

ibotty7y ago

Did you proove it formally or only by experiment. If it were a real proof, did you publish it somewhere?

OscarCunningham7y ago

In fact, any lossless compression algorithm has the property that the output is (on average) at least as long as the input. The best you can hope for is an algorithm that compresses the kind of data that humans want to store, at the expense of making other data a bit longer. If you're trying to compress random data then you just can't do it.

Here's a proof: consider the strings of length n or less, suppose there are M of them in total. Their average length is just the sum of all their lengths divided by M, and the average length of their compressed versions is just the total length of the compressed versions divided by M. Since the compression is lossless the compressed strings must all be different.

Since there are M strings, if any of them mapped to a string of length more than n then there must be some string of length at most n not being mapped to, so the average length can be improved by instead mapping that string to the shorter string. So any optimal compression method must map only to the strings of length at most n.

So the M outputs are just the M inputs, possibly permuted. So their total length is the same, and hence their average length is the same.

Const-me7y ago

> any lossless compression algorithm has the property that the output is (on average) at least as long as the input.

The article you’ve linked says nothing about average. It says that for every algorithm there’s at least some input files that increase the size. It even explains more about that:

Any lossless compression algorithm that makes some files shorter must necessarily make some files longer, but it is not necessary that those files become very much longer. Most practical compression algorithms provide an "escape" facility that can turn off the normal coding for files that would become longer by being encoded. In theory, only a single additional bit is required to tell the decoder that the normal coding has been turned off for the entire input

1 more reply

aengvs7y ago

>In fact, any lossless compression algorithm has the property that the output is (on average) at least as long as the input

I don't think this is true. If it was, lossless compression would be useless in a lot of applications. It's pretty easy to come up with a counter example.

E.g.

(simple huffman code off the top of my head, not optimal)

symbol -> code

"00" -> "0"

"01" -> "10"

"10" -> "110"

"11" -> "111"

If "00" will appear 99.999% of the time, and the other 3 symbols only appear 0.001% of the time, the output will "on average" be slightly more than half the length of the input.

2 more replies

mercora7y ago

i did somewhat the same thing. it introduced me to programming... i thought how about multiple start indices and fixed width? you can then compress the list of start indices in the same manner until you reach sufficient compression :D

j / k navigate · click thread line to collapse