undefined | Better HN

0 pointsmdhb1y ago0 comments

So built on stolen data essentially.

0 comments

Does that imply I just stole your comment by reading it?

No snark intended; I’m seriously asking. If the answer is “no” then where do you draw the line?

mdhbOP1y ago

I don’t actually think this is complicated and reading a comment is not the same thing as scraping the internet and you obviously know that.

A few factors that come to mind would be:

- scale

- informed consent which there was none in this case

- how you are going to use that data. For example using everybody others work so the worlds richest company can make more money from it while giving back nothing in return is a bullshit move.

llamaimperative1y ago

I think it's even simpler than that: incentives. The entire premise of copyright law (and all IP law) is to protect the incentive to create new stuff, which is often a very risky and highly time or capital intensive endeavor.

So here's the question:

Does a person reading a comment destroy the incentive for the author to post it? No. In fact, it is the only thing that produces the incentive for someone to post. People post here when they want that thing to be read by someone else.

Does a model sucking up all the artistic output of the last 400 years and using that to produce an image generator model destroy the incentive of producing and sharing said artistic output? Yes. At least, that is the goal of such a model -- to become so good it is competitive with human artists.

Of course you have plenty of people positioned benefit from this incentive-destruction claiming it does no such thing. I personally tend to put more credence in the words of people who have historically actually been incentivized by said incentives (i.e. artists) who generally seem to perceive this as destructive to their desire to create and share their work.

2 more replies

bigyikes1y ago

I personally disagree but you make fair points.

Scale: Many companies (e.g. Google, Bing) have been scraping at scale for decades without issue. Why does scale become an issue when an LLM is thrown into the mix?

Informed consent: I’m not sure I fully understand this point, but I’d say most people posting content on the public internet are generally aware that people and bots might view it. I guess you think it’s different when the data is used for an LLM? But why?

Data usage: Same question as above.

I just don’t see how ingestion into an LLM is fundamentally different than the existing scraping processes that the internet is built on.

1 more reply

cwp1y ago

Reading a comment is exactly the same thing as scraping the internet, you just stop sooner.

xwolfi1y ago

But then if I write a Pulitzer prize article called "No snark intended: How the web became such a toxic place", where your comment, and all other of ur comments for good measure, figure prominently while I ridicule you and this habit of dumbing down complex problems to reduce them to little witty bites, maybe you'd feel I stole something.

Not something big, not something you can enforce, but you d feel very annoyed Im making good money on something you wrote while you get nothing. I think ?

Spivak1y ago

I think scale is what changes the nature of the thing. At the point where you're having a machine consume billions of documents I don't think you could reasonably call that reading anymore. But what you are doing in my eyes is indexing, and the legal basis for that is heavily dependent on what you do with it.

If a human reads it that would be a reproduction of the work, but if you serve that page as a cache to a human you're okay, usually.

If you compile all that information in a database and use it to answer search queries that's also okay, and nothing forbids you from using machine learning on that data to better answer those search queries.

Both of the above are actually being challenged right now but for the time being they're fine.

But that database is a derivative work, in that it contains copyrighted material and so how you use it matters if you want to avoid infringement — for example a Google employee SSHing to a server to read NYT articles isn't kosher.

What isn't clear is whether the model is a derivative work. Does it contain the information or is it new information created from the training data Sure, if you're clever you could probably encode information in the weights and use it as a fancy zip file but that's a matter of intent. If you use Rewind or Windows Recall and it captures a screenshot of a NYT article and then displays it back to you later is that a reproduction? Surely not. And that's an autonomous system that stores copywritten data and regurgitates it verbatim.

So if it's impractical to actually use it for piracy and it very obviously isn't anyone's intent for it to be used as such then I think it's hard to argue it shouldn't be allowed, even on data that was acquired through back channels.

But copyright is more political than logical so who knows what the legal landscape will be in 5 years, especially when AI companies have every incentive to use their lawyers to pull the ladder up behind them.

cush1y ago

Reading, no. Selling derivative works using, yes.

cwp1y ago

If I read your comment, then write a reply, is it a derivative work?

renewiltord1y ago

Data gets either stolen or freed depending on whether the guy who copied it is someone you dislike or like. Personally, I think that Apple is giving the data more exposure which, as I've been informed many times here, is much more valuable than paying for the data.

kmeisthax1y ago

The irony of "do it for the exposure" is that everyone who actually wants to pay you in exposure isn't actually going to do that, either because they aren't popular enough to measurably expose you, or because they're so popular that they don't want to share the limelight.

AI is a unique third case in which we have billions of creators and no idea who contributed what parts of the model or any specific outputs. So we can't pay in exposure, aside from a brutally long list of unwilling data subjects that will never be read by anyone. Some of the training data is being regurgitated unmodified and needs to be attributed in full, some of it is just informing a general understanding of grammar and is probably being used under fair use, and yet more might not even wind up having any appreciable effect on the model weights.

None of this matters because nobody actually agreed to be paid in exposure, nor was it ever in any AI company's intent - including Apple - to pay in exposure. Data is free purely because it would be extraordinarily inconvenient if anyone in this space had to pay.

And, for the record, this applies far wider than just image or text generators. Apple is almost surely not the worst offender in the space. For example: all that facial recognition tech your local law enforcement uses? That was trained on your Facebook photos.

ytdytvhxgydvhh1y ago

What’s the problem with that? Reproducing copyrighted works in full is problematic obviously. But if I learned English by watching American movies, I didn’t steal the language from the movie studios, I learned it.

asadotzler1y ago

You're not a machine capable of acquiring that "learning" with zero effort and selling that learning to infinite buyers.

threeseed1y ago

Web scraping is legal.

And if you run a website and want to opt-out then simply add a robots.txt.

The standard way of preventing bots for 30 years.

mdhbOP1y ago

How are people supposed to block it when they stole all the data first and then only after that point they decide to even tell anyone what user agent they need to block and how they are planning to exploit your work for their profit.

threeseed1y ago

You just have a rule that says block everything except crawlers: A, B, C.

Also the AppleBot was known about before it appeared in Siri.

1 more reply

j / k navigate · click thread line to collapse

0 comments

bigyikes1y ago

Does that imply I just stole your comment by reading it?

No snark intended; I’m seriously asking. If the answer is “no” then where do you draw the line?

mdhbOP1y ago

I don’t actually think this is complicated and reading a comment is not the same thing as scraping the internet and you obviously know that.

A few factors that come to mind would be:

- scale

- informed consent which there was none in this case

- how you are going to use that data. For example using everybody others work so the worlds richest company can make more money from it while giving back nothing in return is a bullshit move.

llamaimperative1y ago

So here's the question:

2 more replies

bigyikes1y ago

I personally disagree but you make fair points.

Scale: Many companies (e.g. Google, Bing) have been scraping at scale for decades without issue. Why does scale become an issue when an LLM is thrown into the mix?

Data usage: Same question as above.

I just don’t see how ingestion into an LLM is fundamentally different than the existing scraping processes that the internet is built on.

1 more reply

cwp1y ago

Reading a comment is exactly the same thing as scraping the internet, you just stop sooner.

xwolfi1y ago

Not something big, not something you can enforce, but you d feel very annoyed Im making good money on something you wrote while you get nothing. I think ?

Spivak1y ago

If a human reads it that would be a reproduction of the work, but if you serve that page as a cache to a human you're okay, usually.

Both of the above are actually being challenged right now but for the time being they're fine.

cush1y ago

Reading, no. Selling derivative works using, yes.

cwp1y ago

If I read your comment, then write a reply, is it a derivative work?

renewiltord1y ago

kmeisthax1y ago

ytdytvhxgydvhh1y ago

asadotzler1y ago

You're not a machine capable of acquiring that "learning" with zero effort and selling that learning to infinite buyers.