undefined | Better HN

0 pointscrystal_revenge10mo ago0 comments

> An LLM just remembers/get inspired by what it consumes

As mentioned in another comment, LLMs (and most popular machine learning algorithms) can be viewed, correctly, as compression algorithms which leverage lossy encoding + interpolation to force a kind of generalization.

Your argument is that a video wouldn't count as pirated if the compression used for the pirated copy was lossy (or at least sufficiently lossy). The closest real world example would be the cases where someone records a the filming of a movie on their phone then uploads it. Such a copy is lossy enough that you can't produce anything really like the original, but my most definitions is still considered copyright.

0 comments

Workaccount210mo ago

They are in no way compression algorithms. They can be modeled like that in the same way you can model humans as lossy compression algorithms.

You would never use a human to backup your financial reports, but the human might be able to give a good overview. You would never use an LLM to backup your financial reports, but they might be able to give a good overview.

AI training data is disposable. There is nothing that could be called a compression algorithm that disposes all of the data you put into it. AI uses training data as examples of what the next token in a token sequence is. The examples are disposable reference points, not the model itself. That's how you get image models that are 20GB in size despite training on 20PB of data. It's 20PB of examples used to form the shape of a 20GB model. You could show it 5GB of training data or 500EB of training data and it would still be 20GB - because it is not a compression algo, it's a 20GB shape formed by external data.

simion31410mo ago

>It's 20PB of examples used to form the shape of a 20GB model. You could show it 5GB of training data or 500EB of training data and it would still be 20GB - because it is not a compression algo, it's a 20GB shape formed by external data.

You can compress 20PB of text to 20Gb or even less, if input is super repetitive. So the same with images, if 50% of the images are cats then you learn how to represent the cat pixels with a few vectors and then you could represent all the cats int he world doing all possible cat actions.

But please have the courage to respond to this, when the AI is caught regurgitating the exact text from a popular book, the exact verses from a poem, the exact code function from some code , then how can you defend that is not memorizing things? If a human uses my poem(after they read it) and signs his name under it would you defend them?

Workaccount210mo ago

The point is that it isn't compression. Its molding a plain structure iteratively into a ultra complex one. The model starts and ends at 20GB. It might have features that are reminiscent of compression or act like it, but under the hood there is nothing like zip, rar, H.265, or JPEG going on.

And yes LLMs can recall exact material, but it is excerpts and fragments. There is statistical significance to it's ordering. Humans readily do this too (excerpts and fragments), most artists can draw a batman symbol (but not an episode of batman). That doesn't in anyway mean that artists should not be allowed to ever see a batman symbol. It means that artists shouldn't be allowed to get paid to draw one. And they are not. And LLMs are not exempt either.

But the fix is output filtering, just like everything else that can violate copyright. Which is already being done (albeit poorly, but way better than 2 years ago), the same as artists will not draw the batman symbol for you despite being able to.

1 more reply

crystal_revengeOP10mo ago

> They are in no way compression algorithms.

I'm sorry, but this a fundamentally incorrect view of machine learning (including, but not limited to transformers).

From an information theoretic perspective the two are essentially identical with the exception that standard compression algorithms do not have a proper "loss" function other than just trying to minimize reconstruction loss with the resulting compression size.

Here's a link to the section on the Wikipedia for more information if you'd like [0]. MacKay's Information Theory, Inference and Learning Algorithms is the standard full text treatment of this topic [1]. Ted Chiang's article "ChatGPT is a Blurry JPEG of the web" is pretty good "pop sci" exploration of this topic if you don't want to get too into the mathematics [2].

0. https://en.wikipedia.org/wiki/Data_compression#Machine_learn...

1. https://www.inference.org.uk/itprnn/book.pdf

2. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

Workaccount210mo ago

>They can be modeled like that in the same way you can model humans as lossy compression algorithms

Humans are totally capable of data compression. This will just devolved into a semantics game of what a data compressor is.

LLMs were not developed to be, do not function as, and are not use as data compression utilities. Please, come knocking when a service provider exists that will use LLM's to compactly store your company data.

2 more replies

j / k navigate · click thread line to collapse

0 comments

Workaccount210mo ago

They are in no way compression algorithms. They can be modeled like that in the same way you can model humans as lossy compression algorithms.

simion31410mo ago

Workaccount210mo ago

1 more reply

crystal_revengeOP10mo ago

> They are in no way compression algorithms.

I'm sorry, but this a fundamentally incorrect view of machine learning (including, but not limited to transformers).

0. https://en.wikipedia.org/wiki/Data_compression#Machine_learn...

1. https://www.inference.org.uk/itprnn/book.pdf

2. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

Workaccount210mo ago

>They can be modeled like that in the same way you can model humans as lossy compression algorithms

Humans are totally capable of data compression. This will just devolved into a semantics game of what a data compressor is.

2 more replies

j / k navigate · click thread line to collapse