> The model was trained using text databases from the internet. This included a whopping 570GB of data obtained from books, webtexts, Wikipedia, articles and other pieces of writing on the internet. To be even more exact, 300 billion words were fed into the system.
I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them.
Scraping data seems fine in contexts where clicks aren't driven away from the very site the data was scraped from. But in ChatGPT's case, it seems really unfair to these sources and the work that the authors put, as people would no longer even to attempt to go to these sources.
Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?
People get internet hostile at me for this question, but it really is that simple. They've automated you, and it's definitely going to be a problem, but if it's acceptable for your brain to do the same thing, you're going to have to find a different angle to attack it than "fairness".
Because it's false equivalence? ChatGPT isn't a human being. It's a product that is built upon data from other sources.
The question is if this data is legal to scrape, which it is: Web scraping is legal, US appeals court reaffirms [https://news.ycombinator.com/item?id=31075396].
As long as the content is not copyrighted and it's not regurgitating the exact same content, then it should be okay.
So not it's not a false equivalence.
When a person "scrapes" a website by clicking through the link it registers as a hit on the website and, without filters being turned on, triggers the various ad impressions and other cookies. Also if the person needs that information again odds are they'll click on a bookmark or a search link and repeat the impression process all over again.
When an AI scrapes the web it does so once, and possibly in a manner designed to not trigger any ads or cookies (unless that's the purpose of the scrape). It's more equivalent to a person hitting up the website through an archive link.
...it is? I didn't see that question raised in OP's text at all. What do legacy human legalities have to do with how AI will behave?
> Because it's false equivalence? ChatGPT isn't a human being.
Is this important? What is so special about human learning that it puts it in a morally distinct category from the learning that our successors will do?
It sounds like OP is concerned with the ad-driven model of income on the internet, and whether it requires breaking in order for AI to both thrive and be fair.
To be fair, so are you.
There is really no reason to believe that what chatGPT or stable diffusion does is anything like what "your brain" does--except in the most superficial, inconsequential way.
Second, try applying this logic to literally anything else and you'll see why it's absurd:
"You can't ban cars from driving on sidewalks! If it's acceptable for people to walk on sidewalks, then it has to be acceptable for cars to drive on sidewalks, since it's just automated walking"
"You can't ban airplanes from landing in ponds. They fly 'just like' ducks fly! So if it's acceptable for them, it must be acceptable for airplanes too"
Why would it be incoherent to say “I’m okay with a person reading, synthesizing, and then utilizing this synthesis—but I’m not okay with a company profiting off of a computer doing the same thing.” What’s wrong with that?
But again, like you and others have said, it’s really not the same thing at all! All ChatGPT (or any other deep learning model) is capable of doing is synthesizing “in the most superficial way.” What a person does is completely different, much more interesting.
I also agree it's not the only argument and ultimate proof.
I don't at, this point, have an answer. I'm sure this miraculous new technology will survive the luddite attacks, but there will probably be some tense moments, and some jurisdictions will choose to be left behind.
I would say most knowledge about words/grammar/laws of nature can be taken for granted without a citation, but there are some important exceptions where things must be cited. I don't know how you'd reliably teach the difference to a computer though.
If ChatGPT were to be required to share its sources, they would need a completely different approach. I'm not commenting on whether or not that would be a bad thing, but it would render the current iteration completely useless. You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.
Oh, wait, I'm not going to cite sources in a non-scientific work as this leads to madness. The following is a previous post of mine on HN
"Your mind exists in a state where it is constantly 'scraping' copyrighted work. Now, in general limitations of the human mind keep you from accurately reproducing that work, but if I were able to look at your output as an omniscient being it is likely I could slam you with violation after violation where you took stylization ideas off of copyrighted work.
RMS covers this rather well in 'The right to read'. Pretty much any model that puts hard ownership rules on ideas and styles leads to total ownership by a few large monied entities. It's much easier for Google to pay some artist for their data that goes into an AI model. Because the 'google ai' model is now more culturally complete than other models that cannot see this data Google entrenches a stronger monopoly in the market, hence generating more money in which to outright buy ideas to further monopolize the market."
If you previously interacted with people on this issue, you must know that.
It is fair for a single human to breathe, but not for a machine to use all oxygen on this planet at once, killing everyone else in the process.
Air is zero-sum. Knowledge is not.
"Learning is unfair" is not an argument you want to win.
The difference is in scale.
A human video game designer can consume other' people's art, then sell their labor to a video game developer. The amount of value captured by the video game designer rounds down to zero in terms of percentage of economic value created by 'video game art'.
OpenAI can consume all of the video game artists, ever, create an art design product and capture a significant percentage of the economic productivity of video game art.
The difference is scale. At scale it becomes a problem.
Edit: I don't know how to satisfy all parties. This shakes the foundation of copyright. Perhaps we are all finding out how valuable good information truly is and especially in aggregate. We have created proto-gods.
This could be an excellent brain augmentation, trying to hamper it because we want to force people to drag themselves through underlying sources so those sources can try to steal their attention with ads for revenue is asinine.
That is, instead, one of the larger and vastly more important sociocultural issues that actually warrants attention, but never receives it in sufficient degree to address the problem, because, for example, we're arguing whether automated learning is "fair".
https://en.wikipedia.org/wiki/Luddite
Might as well ban computers since they automated and eliminated a lot of manual jobs.
The problem of humans with no money should be solved by a societ safety net and things like UBI.
I don’t do it on an industrial scale.
Imagine if you were a webmaster and Google unilaterally decided to stop sending users to content you have worked to research and write, and instead aggregated it and showed the answer to user’s query entirely on its own pages, without any attribution or payment to you. Unimaginable, yet that is very much the scenario unfolding now. [1]
Scraping at this kind of scale is out of your (or any given individual’s) reach. It is, however, within reach of the likes of Microsoft (on whose billions OpenAI basically exists) and Google (who, to be fair, have not abused it in such a blatant way so far).
[0] It is clearly using someone else’s works for commercial purposes, including to create derivative works. (Again, it’s different from you creating a work derivative from someone else’s work you read previously, because in this case a corporation does it at scale for profit.)
[1] And the cynic in me says the only reason we are not yet out with pitchforks is simply because OpenAI is new and shiny and has “open” in its name (never mind the shadow of Microsoft looming all over it), while Google is an entrenched behemoth that we all had some degree of dissatisfaction with in the past and thus are constantly watching out for.
That is trivially disproved, as is the rest of your argument that follows from it as a premise.
1. You have a widely read spouse named Joe who reads constantly. He's got a good memory, and typically if you have a question you just ask him instead of searching for it yourself. Are you depriving Joe's sources of your eyeballs?
2. Many books summarize and restate other books. If I read Cliff's Notes on a book, for example, I can learn a lot about the original book without buying it. Is this depriving the author?
3. I have a website that proxies requests to other websites and summarizes them while stripping out ads.
So which of these examples are a better metaphor for what a LLM does?
I don't know. The fact is, LLMs are a new thing in our tech and culture and they don't quite fit into any of our existing cultural intuitions or norms. Of course it's ambiguous! But it's also exciting.
Yesterday: 1) You do research, you publish a book, you write some posts. 2) People discover your work and you personally, they visit your posts and subscribe to you. 3) You have an opportunity to upsell your book and make money on ads to sustain your future work; more importantly, you get to see traffic stats and see what is in demand, you get thank-you emails and feel valued.
Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers, and have no reason or even opportunity to buy your book or know you exist.
What exactly are the incentives to publish information openly in that world?
(Will they even believe you if you say you’re the one who did the niche research powering some specific ChatGPT answer, in a world everyone knows that you can just ask an LLM?)
Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers to some related question or interest 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list 6) Goto step 2 from Yesterday
My belief is that ChatGPT is actually not quite capable of that, after seeing examples of how it manufactures non-existing references. Besides, if it were capable of that, why would it not show your name as part of the answer already now?
The cynic in me thinks it’s not capable of that primarily because it is not a priority for OpenAI and training data strips attribution, with an explicit purpose: if the public knows that ChatGPT can trace back the source, OpenAI would be on the hook for paying all the countless non-consensual content providers on which work it makes money.
We should treat OpenAI as we treat Google and Microsoft. It has great talent and charismatic people working for it, but ultimately it’s a for-profit tech company and the name they chose ought to make us all the more suspicious (akin to Google’s “don’t be evil”).
> Why would someone only ask an LLM questions when they were in the market to buy a book?
Why would you be in a market for a book when you can learn the same and more by asking an LLM that already consumed said book? And therefore why would the author spend effort writing and publishing a book knowing it’d sell exactly one copy (to LLM operator)?
Artists are already in full rebellion against this, as they should be, being nearly eclipsed by AI, except when it comes to inventing new styles and hand-crafting samples for the models to train on. These, I assume, are either scraped off the web, or signed away in unfair ToS of various online publishing platforms.
Since the damage individually is small (they took some code from me without attribution, ok) but collectively enormous, in my opinion it the role of government to step in and soften the blow if necessary.
Huh? No. Some artists are maybe?
> as they should be, being nearly eclipsed by AI
Not even close. It's like looking at the newest brand of clip art.
Non-artists don't (maybe can't) know that particular feeling, at least not with regard to being told you're angry about "what's supposed to look like art".
(Heck, artists have been told that with regard to other humans' art for centuries, for one)
Going even further, a lot of artists already know how to build on this new tech without ripping people off.
I used to teach college art classes and would have loved to integrate this topic into the curriculum. It'd be a great ongoing discussion, no matter the legal outcomes.
2. people share, get creative and get some sort of credit for it
3. scrap it all and feed it a large deep neural network and be a worse version of all this content but easily accessible
4. creative people don't see a reason to keep sharing what they have (no new public books, no new open source projects, ...)
5. get stuck in an AI world of recycled content
People blindly following OpenAI products have a very shortsighted vision. What they did is neither innovative, nor extraordinary, they got the data, convinced some victims into a kickstart, made sure the hardware supports the bigger deep neural network that can do the job. Check out the OpenAI alternative solutions, it's not hard.
I came to this realisation arguing with someone in a mutual discord server, about these very topics (the negative impacts of AI). They just couldn't see it, and refused to believe it. I was constantly met with things like "Sure, we'll have to adjust but it'll come" and "Things are no worse now than when the TV and when books were invented" (completely ignoring the many of billions companies are spending to make things more addictive ot our monkey minds, which don't change). Also lots of noble "everyone can use it and it'll benefit everyone"...when really, it only benefits those who can control it. No mention of biases in training data or anything else either. They were really completely blinded to the idea that it might not be good and we should serious admit there are huge issues looming.
I also found it telling that the multiple people like that also weren't fans of in-person interaction, outside their friend group. They saw Discord interactions as just as fine as going out and having serendipitous moments in person, with other real people, and just actually living. Something else I feel technology has stolen from us with everyone always glued to their screen. It's funny how I've become something of a Luddite, proudly, and think we need less internet and more real world, cause, well, life is real world, being human is through real world interactions. And not ones mediated by your phone.
I create a omniscient copyright detection bot and face it at everything you create 24 hours a day 7 days a week.
You go home and sing happy birthday to your kid. The bot gives you a non-monetary warning for using a copyrighted work without permission. No big deal, but it is on your permanent record.
It had been a stressful day so you take up your evening hobby of painting. You like nature scenes and trees and 30 minutes in you receive a violation, evidently Bob Ross has already done this and his surviving estate is now asking you to destroy the picture.
The next day you go to into your job at the corporate bureaucracy slinging lines of javascript. It's been a productive day so far and you have a few hundred new lines of code written and then the bots going off and HR and legal are ringing the phone within seconds. Turns out some comment you'd saw on Stack Overflow years ago was imprinted in your memory well enough you committed a copyright violation. Looks like you'll be losing your job.
If you have a problem with ChatGPT's "scraped data", then you have more fundamental issues with how the internet is as it is today.
If the product is scraping the data and presenting it on their website like ChatGPT and Google, then that's effectively the same as taking away the ad revenue from those websites because they aren't getting the impressions.
Please, people, learn how to focus your thoughts. Go read up on copyright law in the United States. If you go into learning about copyright law trying to justify your own preconceived notions you will gain nothing.
I'm actually not really sure I have an opinion on the ethics of it. Same argument as Adblock. You don't get to control how people consume your content if you put it out in the world for free. That goes for profiles, or articles, reddit posts, StackOverflow, etc. The only thing that's ironic is that large tech companies throw a fit whenever you want to turn the tables and scrape them.
I'd also argue that Google directing traffic to your website is a good alignment of incentives. ChatGPT spitting out answers derived from your work with nothing given back to you in return is not.
I can't seem to find anything on OpenAI's crawler agent, so I'm skeptical they're considering robots.txt at all.
For now, I have removed my existing works, both technical and creative, from the internet and won't be adding more while I try to work out what to do.
On the other hand, the focus on the potential of ChatGPT's natural language processing capabilities highlights the significance of learning and using LLM (Language Models) in data handling. The utilization of LLM can potentially lead to a future where traditional databases become obsolete and are replaced by advanced language models. As such, the development and integration of LLM in our daily lives and processes can bring about many benefits and possibilities.
At some point participating in the internet means your stuff is going to be seen. I wear glasses to read web content. I don't think the glasses company should pay royalties for what I read. chatGPT is a tool that allows me to understand and use the information people put onto the internet better.
Far from a matter of fairness, this is simply another way that selfish people are trying to monetize the future, to make it more and more difficult and expensive for others to participate.
"I've always wished I could charge everyone one earth. chatGTP looks like the future. If I can tap the money flow there I will get mo' money."
I'm against it.
I’m not arguing for either side, just pointing out that we need to carefully consider what rights AI should share with humans.
You're being selfish too. How do you think we have phones, etc... ? Capitalism applies to knowledge too.
ChatGPT took advantage of that and wants to monetize it while cutting people who spent time, money, resources, etc... Just like copilot, plain and simple.
It maybe was unfair to telephone operators when connection automation was implemented, as it made operators obsolete, but the older model couldn't scale, the same way reading text from source doesn't scale for human productivity.
Also, I agree exactly. Advertising is increasingly useless. It's a tax on knowledge and it's gross. I can't wait for it to die.
I want to only pay for the stuff I use.
E.g. Summary of How to Win Friends and Influence People: Effective Steps to Better Interpersonal Relationships by Book Lyte
ChatGPT does more of a mashup with the learned data than humans need to, that'll do me.
We can only hope. It’s unfair to someone that my browser can ask your server for a page, I see an ad for random bullshit nobody would ever care about, and money changes hands behind the scenes and that counts as an economic transaction which boosts GDP. It’s unfair (in my favour) that I can piggy back off this to get things for free.
And when I say “someone“ I suspect “everyone”. Sadly spending money advertising “Yorkshire woman finds guaranteed way to win on the horses” doesn’t seem to have caused anyone to run out of money and have the whole thing collapse yet. And it’s unfair on real small businesses with products paying for adverts which people don’t see or are clicked by bots or are misreported and all they can do is throw money at Google and Facebook and hope.
Clearly, ownership of ideas runs out, because we all use linked lists or binary trees, or paper, or turbines or the list goes on. We don't pay money to the inventors of linked lists, or the heirs or successors-in-interest to the inventor of paper. Why not? When does ownership of an idea expire? Why do we unconsciously accept copyright or patent limits of today?
There's also an issue with simultaneous invention, but that's out of scope here. Clearly ChatGPT is just regurgitating or otherwise emitting previously-ingested material.
I dream of royalties going away so that we only original content that was made for the love of expression, a feeling that it's important. I would be happy to have a LOT less stuff to look at if I didn't have to sift through so much garbage.
Of course, I am also in favor of UBI so that those creators can eat while they are doing it.
"If you applied the same set of rules to a human, how exactly would that look"
Simply put culture is the copying of each others ideas. When one of us started banging rocks together to make them sharp they didn't sell this idea to others, at best they traded sharpened arrows for something else.
The big issue with humans is we are commonly very conservative in our ideas. "Yesterday I did X, today I did X, and tomorrow I'll do X", fine and dandy until tomorrow a machine does X for nearly free. Instead of figuring out how to adapt our economic systems to deal with new systems of cheap and plenty the fearful and the greedy are looking for ways to maximize the amount they can profit or hold it back to maintain status quo.
Discussion is pointless because everyone already has an opinion and it's very firm.
Google is doing this in search results for years, so does bing. apple also does this in their built in dictionary.
why rant about chatgpt that currently at least is a small company in comparison.
question: How could the people who generate used in an ai language model be paid for their work?
answer: There are several ways in which the people who generate content for an AI language model could be paid for their work:
Royalty-based payment: Content creators could receive a percentage of the revenue generated from the use of their content in the AI language model.
Token-based payment: If the AI language model is built on a blockchain, content creators could be paid in tokens that could be traded for cryptocurrency or fiat currency.
Partnership with content publishers: The developers of the AI language model could partner with content publishers to compensate the creators of the scraped content.[0] The answers focus on the technicalities of how the payments could be arranged, but the much bigger problem is that it's not clear who the payments should be going to (there's no immediately obvious or unique way of attributing a given output to specific training inputs, that would require a separate model with a lot of room for judgement/modelling decisions; or a new type of LLM that has that feature baked in).
2 - Code was trained from GitHub. GitHub is Microsoft. OpenAI is Microsoft money. So Microsoft trained its AI on Microsoft code. You disagree? Then GTFO from GitHub and don't feed Microsoft your code anymore.
3 (the most important point) - Q: "Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?"
Fuck YEAH!! please do so. I hope the shit show that ad model is crashes and burn to the ground. You can't use internet without having a solid armor on you with uBlock Origin and/or NoScript (or PiHole if you want the same readable experience on rest of your house devices).
The neural net model is condensed to 800 GB.
https://www.springboard.com/blog/data-science/machine-learni...
Note that the "compression" there also includes the "intelligence" that it presents - you might be able to get some powerful compression of English text... but you can't ask a gzip file to come up with a joke about cats and dinosaurs.
A typical single-spaced page is 500 words long
That’s 179,280,000 full pages of text.
I wonder if they excluded any duplicated text.
Hopefully. This would be the best outcome I can think of for the Internet.
Obviously storage is not a major factor here.
The closest to an authoritative source on it is https://twitter.com/sama/status/1599671496636780546
> average is probably single-digits cents per chat; trying to figure out more precisely and also how we can optimize it
An attempt to work through it from related resources is https://twitter.com/tomgoldsteincs/status/160019698195510069...
In particular https://twitter.com/tomgoldsteincs/status/160019699090561433...
> So what would this cost to host? On Azure cloud, each A100 card costs about $3 an hour. That's $0.0003 per word generated.
> But it generates a lot of words! The model usually responds to my queries with ~30 words, which adds up to about 1 cent per query.
---
It is much less than $1/interaction.
> What's the New York Times scrambled egg recipe?
GPT returns the exact recipe. If I were NYT I'd be frustrated. Their content is now showing without the ad views or paywall.
Is there something analogous to saliency maps for LLM?
1) AI is open sourced and we adapt stably. Either everybody has the opportunity to be their own business, or there is UBI.
2) AI is open sourced but it is unfairly distributed. Only some people are suited to BTOB, and/or UBI is shit.
3) AI is not open sourced, the wealthy edge out mankind and a planet scale genocide occurs.
4) none of it matters because the looming war between the US & China explodes or climate change wipes us out in any meaningful capacity that could pursue AI.
Given the track record of our species, #1 feels like wishful thinking
If you are using GPT as a research tool as opposed to asking your friend who is an expert int the subject, are you citing your friend when you write the paper - or are you going back and finding sources that then back your friend's point up?
It’s really just building a better model.
Will LLMs drive interest/activity away from wikipedia.org? Will it put its own sources of high-quality ad-supported content -- wikihow.com, for example (though I can't be totally sure it scraped from there) -- out of business? Or is there an earth-shattering copyright suit against OpenAI in the works as we speak?
> Can this start breaking the ad-based model of the internet
Is the alternative that everything is behind some kind of paywall by default, to block scraping? Is that where we're heading?
In other areas of society where a bad thing cannot be stopped, we still use legislation to reduce the amount of it and mitigate some of the harm.
So, who gets credit for the word "mat" being generated in that context ? I guess any texts talking about cats and mats in close proximity may deserve some of the "credit", but it goes way deeper than that since why did ChatGPT choose to output such a trite sentence (albeit while only selecting one word at a time), rather that something else about cats or perhaps a more interesting thing that cats often sit in/on ...
People seem to assume that ChatGPT is pulling entire "facts" from various sources, but that's just not how it works - it's just feeding all the texts into a giant meat grinder of word statistics. It knows about words, not facts.
...yes?
OpenAI: "challenge incorrect assumptions"
To me it looks more like memorizing enough of other employees' project contributions to try passing it all off as your own achievements in performance review.
so in that case it wouldn't be unfair ... :-)
did the ChatGPT pay for the content it is using? that was the original question...
"Copyright" "ingenuity of thought" etc are concepts that need to be overhauled since a lot more people now have access to higher education.
How could training an AI on the works of someone who has already been paid for them be unfair? - Possibly because it effects their future marketability and income?
Current authors, artists, internet commenters, clearly have an interest in the results of their creative endeavors being used for gain that they won't benefit from. This is very similar to the extractive monopolies of YouTube and the rest of social media. Their profit at our expense.