undefined | Better HN

0 pointsWorkaccount210mo ago0 comments

>They can be modeled like that in the same way you can model humans as lossy compression algorithms

Humans are totally capable of data compression. This will just devolved into a semantics game of what a data compressor is.

LLMs were not developed to be, do not function as, and are not use as data compression utilities. Please, come knocking when a service provider exists that will use LLM's to compactly store your company data.

0 comments

zerd10mo ago

> It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.

https://arxiv.org/pdf/2309.10668

Transformers are also used in the top algorithm right now on the Large Text Compression Benchmark. https://bellard.org/nncp/nncp.pdf

crystal_revenge10mo ago

> LLMs were not developed to be, do not function as, and are not use as data compression utilities.

Again, from a information theoretic view point, this is exactly what they are doing, how they where developed and how they function.

I don't know any serious researcher in ML that would find this claim even remotely controversial. It's really not just "a semantics game", its a part of a foundational understanding of the topic. If you want to understand LLMs from this perspective, a good place to start is with an auto-encoder which does try to learn a standard compression algorithm, the move on to more sophisticated embedding models (found in a lot of recommender systems) which try to learn an additional objective on top of minimizing reconstruction error. You'll then see that Transformers and all other major NN architectures fall out of these basic principles.

> Please, come knocking when a service provider exists that will use LLM's to compactly store your company data.

This is literally what every vectordb company does right now, as well as all "chat with your docs" type startups.

Workaccount2OP10mo ago

VectorDBs are not LLMs or SQL replacements and RAG is not data compression. Again this is just going to dwindle into semantics and where one draws boundaries. I can randomly remove bits from my HDD and call it compression. If you think humans are data compressors then I have no argument.

Can you get transformers to regurgitate information verbatim? Yes.

Would anyone in their right mind rely on a transformer to do so? No.

Would anyone in their right mind rely on a vectorDB to do so? No.

Would anyone in their right mind use a vectorDB/RAG/SQL/transformer combo to do so? Yes.

Is youtube going to drop VP9 for GeminiEncode to save google billions in bandwidth? No.

j / k navigate · click thread line to collapse