It does suit the modus operandi of a number of American companies that start out as literally illegal/criminal operations until they get big and rich enough to pay a fine for their youthful misdeeds.
By the time some of them get huge, they're in bed with the government to dominate the market.
If it's only OK to scrape, lossy-compress, and redistribute book-paragraphs when it gets blended into a huge library of other attempts, then that's only going to empower big players that can afford to operate at that scale.
Nope. The law will side with whoever pays the most. Once OpenAI solidifies its top position, only then will regulations kick in. Take YouTube, for example—it grew thanks to piracy. Now, as the leader, ContentID and DMCA rules work in its favor, blocking competition. If TikTok wasn’t a copyright-ignoring Chinese company, it would’ve been dead on arrival.
It's completely unprecedented.
We allowed scraping images and text en masse when search engines used the data to let us find stuff.
We allow copying of style, and don't allow writing styles and aesthetics to be copyrighted or trademarked.
Then AI shows up, and people change lanes because they don't like the results.
One of the things that made me tilt towards the side of fair use was a breakdown of the Stable Diffusion model. The SD2.1 base model was trained on 5.85 billion images, all normalized to 512x512 BMP. That's 1MB per images, for a total of 5.85PB of BMP files. The resulting model is only 5.2GB. That's more than 99.999999% data loss from the source data to the trained set.
For every 1MB BMP file in the training dataset, less than 1byte makes it into the model.
I find it extremely difficult to call this redistribution of copyrighted data. It falls cleanly into fair use.
Their arguments against this amounts to "we're not using it like they intend it to be used, so it's fine if we obtain it illegally", and that's a bs standard, totally divorced from any legal reality.
Fair Use covers certain transformative uses, certainly, but it doesn't cover illegal obtaining of the content.
You can't pirate a book just because you want to use it transformatively (which is exactly what they've done), and that argument would never hold up for us as individuals, so we sure as hell shouldn't let tech companies get a special carve-out for it.
burning the bridge so nobody else can legally scrape, that's the line.
The anti-AI stance is what is baffling to me. The path trotten is what got us here and obviously nobody could have paid people upfront for the wild experimentation that was necessary. The only alternative is not having done it.
Given the path it has put as in, people either are insanely cruel or just completely detached from reality when it comes to what is necessary to do entirely new things.
Perhaps the biggest “needs citation” statement of our time.
Not in any weirdly-self-aggrandizing "our tech is so powerful that robots will take over" sense, just the depressingly regular one of "lots of people getting hurt by a short-term profitable product/process which was actually quite flawed."
P.S.: For example, imagine having applications for jobs and loans rejected because all the companies' internal LLM tooling is secretly racist against subtle grammar-traces in your writing or social-media profile. [0]
We don't have to imagine such things, really, as that's extremely common with humans. I would argue that fixing such flaws in LLMs is a lot easier than fixing it in humans.
We have a term for that, it's called "luddite". Those were english weavers who would break in to textile factories and destroy weaving machines at the beginning of the 1800s. With the extreme rare exception, all cloth is woven by machines now. The only hand made textiles in modern society are exceptionally fancy rugs, and knit scarves from grandma. All the clothing you're wearing now are woven by a machine, and nobody gives this a second thought today.
Some not-problems, presented as though they are:
"How can we prevent the untimely eradication of Polio?"
"How can we prevent bot network operators from being unfairly excluded from online political discussions?"
"How can we enable context-and-content-unaware text generation mechanisms to propagate throughout society?"
For example, MKUltra tried to solve a problem: "How can I manipulate my fellow man?" That problem still exists today, and you bet AI is being employed to try to solve it.
History is littered with problems such as these.
Yes, we are clearly talking about things to mostly still come here. But if you assign a 0 until its a 1 you are just signing out of advancing anything that's remotely interesting.
If you are able to see a path to 1 on AI, at this point, then I don't know how you would justify not giving it our all. If you see a path and in the end using all of human knowledge up to this point was needed to make AI work for us, we must do that. What could possibly be more beneficial to us?
This is regardless of all issues the will have to be solved and the enormous amount of societal responsibility this puts on AI makers — which I, as a voter, will absolutely hold them accountable for (even though I am actually fairly optimistic they all feel the responsibility and are somewhat spooked by it too).
But that does not mean I think it's responsible to try and stop them at this point — which the copyright debate absolutely does. It would simply shut down 95% of AI, tomorrow, without any other viable alternative around. I don't understand how that is a serious option for anyone who roots for us.
Firstly, *skeptics.
Secondly, being skeptical doesn't mean you have no optimism whatsoever, it's about hedging your optimism (or pessimism for that matter) based on what is understood, even about a not-fully-understood thing at the time you're being skeptical. You can be as optimistic as you want about getting data off of a hard drive that was melted in a fire, that doesn't mean you're going to do it. And a skeptic might rightfully point out that with the drive platters melted together, data recovery is pretty unlikely. Not impossible, but really unlikely.
Thirdly, OpenAI's efforts thus far are highly optimistic to call a path to true AI. What are you basing that on? Because I have not a deep but a passing understanding of the underlying technology of LLMs, and as such, I can assure you that I do not see any path from ChatGPT to Skynet. None whatsoever. Does that mean LLMs are useless or bad? Of course not, and I sleep better too knowing that LLM is not AI and is therefore not an existential threat to humanity, no matter what Sam Altman wants to blither on about.
And fourthly, "wanting" to stop them isn't the issue. If they broke the law, they should be stopped, simple as. If you can't innovate without trampling the rights of others then your innovation has to take a back seat to the functioning of our society, tough shit.
I don’t think that the consumer LLMs that openai is pioneering is what need optimism.
AlphaFold and other uses of the fundamental technology behind LLMs need hype.
Not OpenAI
I think you raise some interesting concerns in your last paragraph.
> enormous amount of societal responsibility this puts on AI makers — which I, as a voter, will absolutely hold them accountable for
I'm unsure of what mechanism voters have to hold private companies accountable. Fir example, whenever YouTube uses my location without me ever consenting to it - where is the vote to hold them accountable? Or when Facebook facilitates micro targeting of disinformation - where is the vote? Same for anything AI. I believe any legislative proposals (with input from large companies) is very likely more to create a walled garden than to actually reduce harm.
I suppose no need to respond, my main point is I don't think there is any accountability thru the ballot when it comes to AI and most things high-tech.
Oh, the humanity! Who will write our third-rate erotica and Russian misinformation in a post-AI world?
OpenAI's case is especially egregious, with the entire starting as 'open' and reaping the benefits, then doing its best in every way to shut the door after itself by scaring people over AI apocalypses. If your argument is seriously that it is necessary to shamelessly steal and lie to do new things, I question your ethical standards, especially in the face of all the openly developed models out there.
The anti-AI stance is what is baffling to me.
I think it’s unfair to paint any legal controls over this incredibly important, high-stakes technology as being “anti”. They’re not trying to prevent innovation because they’re cruel, they’re just trying to somewhat slow down innovation so that we can ensure it’s done with minimal harm (eg making sure content creators are compensated in a time of intense automation). Like we do for all sorts of other fields of research, already!And isn’t this what basically every single scholar in the field says they want, anyway - safe, intentional, controlled deployment?
As you can tell from the above, I’m as far from being “anti-AI” or technically pessimistic as one can be — I plan to dedicate my life to its safe development. So there’s at least one counterexample for you to consider :)
> The anti-AI stance is what is baffling to me
I don't see s lot of anti AI but instead I see a concern for how it's just being managed and controlled by the larger companies with resources that no start up could dream. Open AI was to release it's models and be well.. Open but fine they're not. But their behaviour of how things are proceeding are questionable and unnecessarily aggravating.
And who is the one calling for action?
Sorry for being dense, but I'm trying to understand if I'm the "strong" or the "weak" in your analogy.
The work of artists, authors, etc.
I know currently the legal situation is messy, but that's exactly the point, anyone who can't engage in lengthy legal battle and defend their position in court are being sacrificed. The companies behind LLMs are spending hundreds of millions of dollars in lobbying and exploiting loopholes.
Let's be real without the data there wouldn't be LLMs, so it crazy that some people are downplaying its significance or value, while on the other hand they're losing sleep over finding fresh sources to scrape.
The big publishers seem to have given up and decided it's best to reach agreement with their counterparts, while independent authors are given the finger.
"Hugely beneficial" is a stretch at this point. It has the potential to be hugely beneficial, sure, but it also has the potential to be ruinous.
We're already seeing GenAI being used to create disinformation at scale. That alone makes the potential for this being a net-negative very high.
I don't think this is the "ends justify the means" argument you think it is.
Automobiles allow people to travel great distances over short periods of time, increase physical work capacity, allow for building massive structures, and allow for farming insane amounts of food.
Both the internet and automobiles have positively affected my life, and I assume the lives of many others. How are any of these aimless questions?
I do not have confidence in the Supreme Court in general, and I think there's a real risk that in deciding on AI training they upend copyright of digital materials in a way that makes it worse for everyone.
The crazy thing is that there hasn't been an injunction to make them stop.
Update: ML doesn't copy information. It can merely memorise some small portions of it.
A more fitting metaphor would be something like... If you had the ability to read all the books in the library extremely quickly, and to make useful mental connections between the information you read such that people would come to you for your vast knowledge, should you be allowed in the library?
https://www.copyright.gov/title37/201/37cfr201-14.html
§ 201.14 Warnings of copyright for use by certain libraries and archives.
....
The copyright law of the United States (title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material.
Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specific conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use,” that user may be liable for copyright infringement.
This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law.
You can make a copy. If you (the person using the copied work) are using it for something other than private study, scholarship, research, or reproduction beyond "fair use", then you - the person doing that (not the person who made the copy) are liable for infringement.It would be perfectly legal for me to go to the library and make photocopies of works. I could even take them home and use the photocopies as reference works write an essay and publish that. If {random person} took my photocopied pages and then sold them, that would likely go beyond the limits placed for how the photocopied works from the library may be used.
Recording devices permitted artists to sell more art.
Many of the uses of AI people get most excited about seem to be cutting the expensive human creators out of the equation.
So yeah it had a profound effect, but we got consent for the parts that fundamentally relied on other people.