There's a quite "legendary" Game Boy Advance game out there (Klonoa - Densetsu no Star Medal) that never got a translation to English because it has some sort of in-house created compression by Namco applied to the game that was made so it could fit into a GBA cartridge. AFAIK no one was ever able to crack it open and release the code to de/compress it.
A while ago I had a "bounty" of USD100 for anyone that could do it (just the decompression and re-compression, not translation) but there aren't many people that want to fiddle with low-level GBA coding.
If the bounty was high enough there are people out there who do this sort of thing professionally and would probably jump on the opportunity.
EDIT: You are missing `csv+zstd` ? It should obsolete `csv+gzip` at all speeds and compression levels.
There is a pareto-optimality frontier here - I ran my testing back in 2016 https://code.ivysaur.me/compression-performance-test/ but the numbers are now a little bit obsolete (e.g. zstd and brotli have both seen a lot of improvements).
EDIT: indeed, it's missing in `to_csv` - seems like an oversight.
There was, how ever, no machine learning or optimizing. Instead, he called it "prospecting" and just generate a new one from scratch each time until he found something interesting.
I'm particularly proud of this meta approach and I am actually thinking this could become huge: the same thing can be done for hyperparameter optimization in machine learning tasks.
Hyperparamter optimization is currently focused on minimizing cross-validation error, but using this concept you could have weights on accuracy, training time and prediction time (very similar to compression where the 3 dimensions are size, write time and read time), and then given a new unknown dataset you could predict what model/hyperparameters to use.
Maybe this should be patented ;)
There is already a substantial field of Machine Learning/Meta Learning which focuses on exactly this. For example, this paper [1] from NeurIPS 2015 does exactly what you suggest.
[1]: https://papers.nips.cc/paper/5872-efficient-and-robust-autom...
Off the cuff, this seems like it competes with an alternative of simply running every considered compression algorithm and choosing the optimal one. I guess this would be advantageous if the RF classifier is meaningfully faster to run than the compression algorithms themselves. Is it?
It was really just the example that made me wonder why I have to consider which compression would be best for files with my characteristics - but not saying it was best practice to begin with haha!
(I use PostgreSQL, due to my ignorance.)
See this recent timescaledb post (since you mentioned Postgres) which goes over an array of techniques used in column-store databases that general-purpose compressors on a csv file full of data would not be able to match:
https://blog.timescale.com/blog/building-columnar-compressio... discussion: https://news.ycombinator.com/item?id=21412596
Gwern was kind enough to assist by sending over the exact version numbers of all the Haskell libraries it depends on, and answering some questions about deployment. The version numbers turned out to be crucial to getting everything running.
IMO https://www.gwern.net/ is the ideal combination of style + ease of use (for the writer) + effective ways of organizing knowledge.
The whole thing is hosted out of an S3 bucket, so there's no server to manage and zero downtime. I've wondered if it'd be possible to use github pages for this purpose, since that would make it completely free. But it only takes a couple hours of work to get everything up and running. The biggest delay is waiting for haskell to compile all the libraries.
I've got an easy-to-use library for arithmetic encoding in Java, it would be easy to port to other languages: https://github.com/comperical/MirrorEncode
I'm pretty ignorant on the topic, so I get that that may be off, but if so, why wouldn't that be a valid solution to compressing data?
The problem is that you have to encode/send the DNN itself, otherwise your receiver won't know how to decode the data. If you are not smart, the added codelength of the DNN will likely blow away your savings. If you are smart, this leads to a whole formulation of machine learning called MDL:
More seriously, the generalized compression algorithm can be one of the keys or even one of the definitions of generalized artificial intelligence
And Kolmogorov’s complexity was present by a subtle reference in the very last episode ...
This is using machine learning to predict on the given heterogeneous tabular data which compression algorithm will yield a higher compression rate.
There are multiple other ways to do this approximation.