Ask HN: Isn&#x27;t ChatGPT unfair to the sources it scraped data from?

dmak3y ago

How is it not scraping? There's no other way to get all that data for training a model without scraping.

Check me on this because I'm not a software person:

When a person "scrapes" a website by clicking through the link it registers as a hit on the website and, without filters being turned on, triggers the various ad impressions and other cookies. Also if the person needs that information again odds are they'll click on a bookmark or a search link and repeat the impression process all over again.

When an AI scrapes the web it does so once, and possibly in a manner designed to not trigger any ads or cookies (unless that's the purpose of the scrape). It's more equivalent to a person hitting up the website through an archive link.

jMyles3y ago

> The question is if this data is legal to scrape

...it is? I didn't see that question raised in OP's text at all. What do legacy human legalities have to do with how AI will behave?

> Because it's false equivalence? ChatGPT isn't a human being.

Is this important? What is so special about human learning that it puts it in a morally distinct category from the learning that our successors will do?

It sounds like OP is concerned with the ad-driven model of income on the internet, and whether it requires breaking in order for AI to both thrive and be fair.

venv3y ago

>Is this important?

Well yes, it's the whole crux of the matter. Laws govern human behaviour. As of 2023, only living beings have agency. If I shoot someone with a gun, the criminal is me and not the gun. Being a deterministic piece of silicon, a computer is perfectly equivalent. Sure, it is important to start a discussion of potential nonhuman sentience in the future, but these AI models are not unlike any previous software in legal issues. It's bizarre to me how many people are missing this.

bilsbie3y ago

> It's a product that is built upon data from other sources.

To be fair, so are you.

nCave3y ago

Does anyone actually find these arguments persuasive?

There is really no reason to believe that what chatGPT or stable diffusion does is anything like what "your brain" does--except in the most superficial, inconsequential way.

Second, try applying this logic to literally anything else and you'll see why it's absurd:

"You can't ban cars from driving on sidewalks! If it's acceptable for people to walk on sidewalks, then it has to be acceptable for cars to drive on sidewalks, since it's just automated walking"

"You can't ban airplanes from landing in ponds. They fly 'just like' ducks fly! So if it's acceptable for them, it must be acceptable for airplanes too"

dawsoneliasen3y ago

Yes, and: why shouldn’t it matter that in one case it is a person and in another it is a computer program?

Why would it be incoherent to say “I’m okay with a person reading, synthesizing, and then utilizing this synthesis—but I’m not okay with a company profiting off of a computer doing the same thing.” What’s wrong with that?

But again, like you and others have said, it’s really not the same thing at all! All ChatGPT (or any other deep learning model) is capable of doing is synthesizing “in the most superficial way.” What a person does is completely different, much more interesting.

BurningFrog3y ago

I find the argument pretty persuasive.

I also agree it's not the only argument and ultimate proof.

I don't at, this point, have an answer. I'm sure this miraculous new technology will survive the luddite attacks, but there will probably be some tense moments, and some jurisdictions will choose to be left behind.

hannob3y ago

You usually expect people to cite sources. Granted, that very often doesn't happen, and the amount of citing expected depends on the context. But ChatGPT just doesn't cite sources at all. I think there's a case to be made that they should.

ericd3y ago

People don’t remember the sources that formed their opinions, it’s just baked into the structure of their brain after reading, same for the model.

stevesearer3y ago

With search engines, it does feel like there was is a more clear trade of scraping access in exchange for web traffic.

With ChatGPT the traffic benefit isn’t there, so it feels like it isn’t a fair trade.

Google adding the context and data to their search results page also started blurring this trade making it unnecessary to click to the site the info was cleaned from.

rom-antics3y ago

Humans have a pretty good sense of when you need to cite sources, and when you don't. For example, long ago I learned from some website how to write a for-loop in python, and now I write them all the time without giving credit. I'm okay with ChatGPT writing a for-loop without citing its source.

I would say most knowledge about words/grammar/laws of nature can be taken for granted without a citation, but there are some important exceptions where things must be cited. I don't know how you'd reliably teach the difference to a computer though.

candiodari3y ago

And yet, exactly in this example, I HATE that people don't put sources. Perhaps not for "for loops", but search anything simple in python. "Python JSON output", for example, and you will find a billion articles that describe a simple python library ... but DON'T link to python.org or the "javadoc". They're always dicussing the most blatantly obvious simple thing, never remotely complete, never link to where you can actually find more info (but jobs, courses, ads, ... those will be linked)

It's getting me to the point of refusing to use Google, or only use Google with "site:...". I mean, the site varies, but without site limits Google's becoming useless.

lolinder3y ago

ChatGPT doesn't have a concept of sources. It has weights that together define a function that allow it to guess the most likely next word from the context. As a neat side effect of this contextual next-word guessing, it often can share accurate information.

If ChatGPT were to be required to share its sources, they would need a completely different approach. I'm not commenting on whether or not that would be a bad thing, but it would render the current iteration completely useless. You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.

> You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.

I've read that ChatGPT is not connected to the net, but if it was: Couldn't you have it do a google search (or better yet corpus search) for the string it generated and then return the most significant matches (significance by string matching, not google rank)? It would be really crude, but wouldn't this just be a handful of lines of code that don't interfere with the "transformers-based model" code at all?

christkv3y ago

The bing leak seemed to mention sources.

pixl973y ago

Hello [Oxford Dictionary: 1827]

Oh, wait, I'm not going to cite sources in a non-scientific work as this leads to madness. The following is a previous post of mine on HN

"Your mind exists in a state where it is constantly 'scraping' copyrighted work. Now, in general limitations of the human mind keep you from accurately reproducing that work, but if I were able to look at your output as an omniscient being it is likely I could slam you with violation after violation where you took stylization ideas off of copyrighted work.

RMS covers this rather well in 'The right to read'. Pretty much any model that puts hard ownership rules on ideas and styles leads to total ownership by a few large monied entities. It's much easier for Google to pay some artist for their data that goes into an AI model. Because the 'google ai' model is now more culturally complete than other models that cannot see this data Google entrenches a stronger monopoly in the market, hence generating more money in which to outright buy ideas to further monopolize the market."

klpr3y ago

No, it isn't that simple. The scale and totality of the scraping is out of reach for a human.

If you previously interacted with people on this issue, you must know that.

It is fair for a single human to breathe, but not for a machine to use all oxygen on this planet at once, killing everyone else in the process.

rom-antics3y ago

If I woke up tomorrow and breathed all the oxygen, nobody else could breathe. But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.

Air is zero-sum. Knowledge is not.

> But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.

If you became the first line, go-to source for the information of those websites, those websites would stop getting click-throughs. Eventually it would become less and less worthwhile (economically or emotionally) for the people keeping those sites running to keep them running. It would become more and more difficult for people to find those sites even if they are running, or even the archives of those sites.

So yes, eventually you'd stop people from reading them too.

ergonaught3y ago

It is in fact that simple. There are dozens, hundreds, perhaps thousands of legitimate, genuine, serious, real reasons to be concerned and "want something to be done". This isn't it.

"Learning is unfair" is not an argument you want to win.

wiseowise3y ago

Love how you’ve conveniently ignored

> The scale and totality of the scraping is out of reach for a human.

nohuck133y ago

> It is in fact that simple.

Why?

nickfromseattle3y ago

>Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

The difference is in scale.

A human video game designer can consume other' people's art, then sell their labor to a video game developer. The amount of value captured by the video game designer rounds down to zero in terms of percentage of economic value created by 'video game art'.

OpenAI can consume all of the video game artists, ever, create an art design product and capture a significant percentage of the economic productivity of video game art.

2OEH8eoCRo03y ago

At a human level it falls below the noise floor. It's a fact of life that humans will learn and build from experience.

The difference is scale. At scale it becomes a problem.

Edit: I don't know how to satisfy all parties. This shakes the foundation of copyright. Perhaps we are all finding out how valuable good information truly is and especially in aggregate. We have created proto-gods.

ericd3y ago

At scale, it becomes a wonderful tool. Are the people in this thread so threatened or so invested in the current business models of the internet that you can’t see how amazing this sort of thing could be for our abilities as a species? Not just in its current iteration, but it will get better and better.

This could be an excellent brain augmentation, trying to hamper it because we want to force people to drag themselves through underlying sources so those sources can try to steal their attention with ads for revenue is asinine.

2OEH8eoCRo03y ago

It is a wonderful tool but I still feel that the creators of the training data are getting shafted. I'm both amazed and horrified at our creation and what it portends.

Do you think just maybe there is a diffence here because humans need money to survive, and maybe we should have compassion for humans who could hypothetically starve or freeze or suicide or whatever because they have no money? Or is it just silly to care about people like that?

ergonaught3y ago

That's got nothing to do with whether or not it is "fair" for a learning system to produce content after it has learned.

That is, instead, one of the larger and vastly more important sociocultural issues that actually warrants attention, but never receives it in sufficient degree to address the problem, because, for example, we're arguing whether automated learning is "fair".

https://en.wikipedia.org/wiki/Luddite

If "fairness" isn't worth figuring out for a society, why is our entire economic order nominally built ontop of such a virtue? How is this not the very thing we are talking about? People starve on the streets right now because any other arrangement of resources has been deemed "unfair." Do we not sign a contract for our labor or for our homes because of shared idea of fairness? Fairness is the ultimate thing we appeal to in our world, it is the only thing that can sustain the intense individuality of the modern world. Dont ambiguate it as a Nietzchean moral fairness here, we are talking about the pseudo-algorithmic fairness of a market which guarantees certain things if you trade enough of your resources.

belltaco3y ago

Isn't that like saying automated looms should be banned because it meant humans would lose jobs to it? Or buggy whip drivers wanting to ban cars?

Might as well ban computers since they automated and eliminated a lot of manual jobs.

The problem of humans with no money should be solved by a societ safety net and things like UBI.

Well, the luddites were right in that they were fighting a good and honorable fight.

Until things are, in fact, solved by whatever idea you might have, why should we just accept each new thing that makes our human lives more intolerable? How could you expect any rational person to have that kind of blind trust in a technology, much less "progress" itself, when every single aspect of our world shows that it is who owns the technology that actually benefits from it? I think it is much more crazy just totally rolling over for each new thing that takes your job than it is to maybe fight for your food and shelter.

I think we can do better than UBI, but either way, fighting against this unfairness is fighting for the things we need to continue with some shred of humanity, insofar as this technology is and will be an agent for the consolidation of labor and profit. Its all the same fight, and the historical luddites understood this consciously or not.

Who knows, maybe the internet would have been better off if some people were brave enough to smash some of Google's servers in like 2006..

hn_throwaway_993y ago

This is a response to an argument the GP didn't make. One can still have grave concerns about generative AI's potential impact on human society while accepting there is nothing fundamentally unfair about how it scrapes publicly accessible data.

Ok, so, what is that argument then? Maybe you just want to say: "well this is such a big deal its going to change everything anyway, so we can't judge it on how it will affect today's society, but rather the society it will create." If so, all one can really say to such longtermism is: "well, good luck with that I guess, I will keep trying to survive over here."

BeFlatXIII3y ago

Teach the suicidal ones the noble art of suicide bombing and force societal change that way.

wiseowise3y ago

> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

I don’t do it on an industrial scale.

anileated3y ago

It’s fair if you do this; neither fair nor legal [0] when a commercial for-profit entity, backed by a large corporation, does it at scale and capitalizes on that.

Imagine if you were a webmaster and Google unilaterally decided to stop sending users to content you have worked to research and write, and instead aggregated it and showed the answer to user’s query entirely on its own pages, without any attribution or payment to you. Unimaginable, yet that is very much the scenario unfolding now. [1]

Scraping at this kind of scale is out of your (or any given individual’s) reach. It is, however, within reach of the likes of Microsoft (on whose billions OpenAI basically exists) and Google (who, to be fair, have not abused it in such a blatant way so far).

[0] It is clearly using someone else’s works for commercial purposes, including to create derivative works. (Again, it’s different from you creating a work derivative from someone else’s work you read previously, because in this case a corporation does it at scale for profit.)

[1] And the cynic in me says the only reason we are not yet out with pitchforks is simply because OpenAI is new and shiny and has “open” in its name (never mind the shadow of Microsoft looming all over it), while Google is an entrenched behemoth that we all had some degree of dissatisfaction with in the past and thus are constantly watching out for.

CPLX3y ago

Your argument contains the implicit assumption that we have the same rules for machines that we do for humans.

That is trivially disproved, as is the rest of your argument that follows from it as a premise.

xg153y ago

If you treat ChatGPT like a human, how high is the salary you are going to pay it?

lukev3y ago

I think this is a real concern, but imagine a couple other scenarios:

1. You have a widely read spouse named Joe who reads constantly. He's got a good memory, and typically if you have a question you just ask him instead of searching for it yourself. Are you depriving Joe's sources of your eyeballs?

2. Many books summarize and restate other books. If I read Cliff's Notes on a book, for example, I can learn a lot about the original book without buying it. Is this depriving the author?

3. I have a website that proxies requests to other websites and summarizes them while stripping out ads.

So which of these examples are a better metaphor for what a LLM does?

I don't know. The fact is, LLMs are a new thing in our tech and culture and they don't quite fit into any of our existing cultural intuitions or norms. Of course it's ambiguous! But it's also exciting.

anileated3y ago

It is not breaking the ad-based model—it’s breaking open information sharing culture as we know it.

Yesterday: 1) You do research, you publish a book, you write some posts. 2) People discover your work and you personally, they visit your posts and subscribe to you. 3) You have an opportunity to upsell your book and make money on ads to sustain your future work; more importantly, you get to see traffic stats and see what is in demand, you get thank-you emails and feel valued.

Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers, and have no reason or even opportunity to buy your book or know you exist.

What exactly are the incentives to publish information openly in that world?

(Will they even believe you if you say you’re the one who did the niche research powering some specific ChatGPT answer, in a world everyone knows that you can just ask an LLM?)

pharke3y ago

Why would someone only ask an LLM questions when they were in the market to buy a book? Most people I know don't buy books in order to look up the answer to a question, sure some people buy reference books and use them but that's not really what we think of when talking about authors and books. If I'm in the market for a book, I'm looking to read a book, not query something or someone for answers. I think your example should go like this:

Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers to some related question or interest 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list 6) Goto step 2 from Yesterday

anileated3y ago

> 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list

My belief is that ChatGPT is actually not quite capable of that, after seeing examples of how it manufactures non-existing references. Besides, if it were capable of that, why would it not show your name as part of the answer already now?

The cynic in me thinks it’s not capable of that primarily because it is not a priority for OpenAI and training data strips attribution, with an explicit purpose: if the public knows that ChatGPT can trace back the source, OpenAI would be on the hook for paying all the countless non-consensual content providers on which work it makes money.

We should treat OpenAI as we treat Google and Microsoft. It has great talent and charismatic people working for it, but ultimately it’s a for-profit tech company and the name they chose ought to make us all the more suspicious (akin to Google’s “don’t be evil”).

> Why would someone only ask an LLM questions when they were in the market to buy a book?

Why would you be in a market for a book when you can learn the same and more by asking an LLM that already consumed said book? And therefore why would the author spend effort writing and publishing a book knowing it’d sell exactly one copy (to LLM operator)?

pharke3y ago

It's very much in their interest, if the information their models provide is impossible to verify then it severely limits its uses. You essentially can't use it as a source for anything that requires any type of citation or reliability. That's a huge handicap for selling it to businesses and researchers. The general problem of determining what training data was used to produce an output is an open problem in ML and one that is being very actively worked on since it would greatly further the field.

You believe correctly that ChatGPT is not capable of showing sources, it's currently impossible to do but we were discussing Tomorrow so I included it as a possibility. You could potentially hack it in now using traditional search or nearest neighbours but it wouldn't be 100% accurate, probably not even 50%, it would just show a bag of similar texts so not really worth doing.

I'd still be in the market for a book even if we had a perfect LLM that could answer every question I had with impeccable accuracy. I read books because I want to find out about things I don't know that I don't know. It's pretty hard to find those things if you just do question response. It's like a graph, if you start at one node it may take you a very long time to traverse the graph to another node but if you have some outside source that gives you the address of a new node you can just jump straight to it.

ForestCritter3y ago

Exactly. As a professional artist, I am expected to have a public online portfolio and publicly available imagery of shows and exhibits. Saying that I'm forfeiting my stake in my art because I'm showing it publicly is a really great way to kill art and culture. AI is not learning to make, draw, use mediums in a skilled manner. AI is scraping my public images and plottlining them with the input of humans to label them, tag them and apply stylistic qualities to them.Just because there are massive amounts of data to dilute influence doesn't change that the computer is still simply doing what a human is telling it to do with imagery created by humans. If you took away the human input, labeling and tagging you will find that the computer has not learned anything. I can look at 'AI' art and pick out artists from the collated imagery. Unlike 'AI'I can't spit out the imagery by photocopying/plottlining/tracing it. I have to learn the skills of each artist involved to recreate what I see. Motor skills require practice and effort. 'AI' is not learning motor skills, which is the basis of the creation of art. It is mapping and applying statistical algorithms to amalgamate data from preexisting sources for those who want 'Art' without the effort of time or skill to produce it. At this very moment 'AI' art is being used to sell merchandise with zero credit or monies going to the people who used their human motor skills to create the backbone of this art. Sadly,this only agravates the ways copyright already restricts human art.Imagine if we lived in a world where people valued artists with respect for thier craft? I once had someone ask me how long it took me to draw a charcoal drawing. The short answer is half an hour. The long answer is that I was doing daily scketching practice and investing many hours a week doing charcoal excercises. I am currently out of practice with charcoal and as it is a medium with no erasing or margin of error, I doubt I could recreate my drawing myself without 'getting my hand back in'. It is obvious to me that this 'AI' tool is being used by humans, with the industry of humans, to exploit humans for the gratification of end user humans. I suppose humans could stop making art to feed the monster...

dmarcos3y ago

That’s my main fear. Not the fairness / unfairness but that people might be less willing to share info and a lot becomes inaccessible / secret.

anileated3y ago

I am also anxious about the web becoming fragmented and secretive. If one must gain access to the right circles to start learning, it hinders learning in general, and for myself and many people I know would basically mean we wouldn’t be doing what we’re doing if it were the case when we were younger.

dmarcos3y ago

Exactly. We’re in ChatGPT honeymoon but incentives to share info moving forward unclear. Could see big model owners paying for exclusive access to content / data hindering the free distribution of information and becoming like the old publishers and gatekeepers.

daevout3y ago

Yes it absolutely is, but imo less so than what GitHub Copilot and various image generation companies are doing. My theory is that if AI turns out to be as disruptive as the current hype suggests, the conflict between those who feed the AI vs. those who profit from it might be the next big social rift.

Artists are already in full rebellion against this, as they should be, being nearly eclipsed by AI, except when it comes to inventing new styles and hand-crafting samples for the models to train on. These, I assume, are either scraped off the web, or signed away in unfair ToS of various online publishing platforms.

Since the damage individually is small (they took some code from me without attribution, ok) but collectively enormous, in my opinion it the role of government to step in and soften the blow if necessary.

themodelplumber3y ago

> Artists are already in full rebellion against this,

Huh? No. Some artists are maybe?

> as they should be, being nearly eclipsed by AI

Not even close. It's like looking at the newest brand of clip art.

Non-artists don't (maybe can't) know that particular feeling, at least not with regard to being told you're angry about "what's supposed to look like art".

(Heck, artists have been told that with regard to other humans' art for centuries, for one)

Going even further, a lot of artists already know how to build on this new tech without ripping people off.

I used to teach college art classes and would have loved to integrate this topic into the curriculum. It'd be a great ongoing discussion, no matter the legal outcomes.

Sakos3y ago

Absolutely, yes. It's incredibly unfair. But techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.

jostiniane3y ago

1. get to enjoy an open network of networks

2. people share, get creative and get some sort of credit for it

3. scrap it all and feed it a large deep neural network and be a worse version of all this content but easily accessible

4. creative people don't see a reason to keep sharing what they have (no new public books, no new open source projects, ...)

5. get stuck in an AI world of recycled content

People blindly following OpenAI products have a very shortsighted vision. What they did is neither innovative, nor extraordinary, they got the data, convinced some victims into a kickstart, made sure the hardware supports the bigger deep neural network that can do the job. Check out the OpenAI alternative solutions, it's not hard.

dorchadas3y ago

> but techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.

I came to this realisation arguing with someone in a mutual discord server, about these very topics (the negative impacts of AI). They just couldn't see it, and refused to believe it. I was constantly met with things like "Sure, we'll have to adjust but it'll come" and "Things are no worse now than when the TV and when books were invented" (completely ignoring the many of billions companies are spending to make things more addictive ot our monkey minds, which don't change). Also lots of noble "everyone can use it and it'll benefit everyone"...when really, it only benefits those who can control it. No mention of biases in training data or anything else either. They were really completely blinded to the idea that it might not be good and we should serious admit there are huge issues looming.

I also found it telling that the multiple people like that also weren't fans of in-person interaction, outside their friend group. They saw Discord interactions as just as fine as going out and having serendipitous moments in person, with other real people, and just actually living. Something else I feel technology has stolen from us with everyone always glued to their screen. It's funny how I've become something of a Luddite, proudly, and think we need less internet and more real world, cause, well, life is real world, being human is through real world interactions. And not ones mediated by your phone.

blue_cookeh3y ago

100% agree - the likes of ChatGPT are straight up generating revenue based on adding value to stolen work.

pixl973y ago

Lets turn this around the other way.

I create a omniscient copyright detection bot and face it at everything you create 24 hours a day 7 days a week.

You go home and sing happy birthday to your kid. The bot gives you a non-monetary warning for using a copyrighted work without permission. No big deal, but it is on your permanent record.

It had been a stressful day so you take up your evening hobby of painting. You like nature scenes and trees and 30 minutes in you receive a violation, evidently Bob Ross has already done this and his surviving estate is now asking you to destroy the picture.

The next day you go to into your job at the corporate bureaucracy slinging lines of javascript. It's been a productive day so far and you have a few hundred new lines of code written and then the bots going off and HR and legal are ringing the phone within seconds. Turns out some comment you'd saw on Stack Overflow years ago was imprinted in your memory well enough you committed a copyright violation. Looks like you'll be losing your job.

paulcole3y ago

The first two situations you mention almost certainly aren’t copyright violations. The third is at least a solid “maybe.”

elcomet3y ago

They aren't, or they shouldn't be, but that's the point of the parent's comment.

Look at the videos flagged by youtube or copyright trolls, a lot of them are not actual copyright violations, but they are flagged anyway by the algorithm and removed or demonetized. And it takes a lot of work to fight those claims.

bediger40003y ago

The first situation isn't copyright violation because some monied entity went out and litigated against Warner/Chappell music. That's the problem with copyright - until you've litigated, which is expensive, you just can't tell what's in and what's out of copyright. You wrote "almost certainly" because of that.

JohnFen3y ago

The third exists in a form right now. I've worked for a couple of companies that require all code to be run through a scanner intended to detect if that code has been lifted from open source codebases. You won't get fired if it has, but you will be required to remove it before it is accepted into the codebase.

dmak3y ago

All of a sudden everyone has issues with free information when Google has been doing this for years.

dmak3y ago

All I have to say is, as technologists, anyone who is criticizing ChatGPT and has not been criticizing Google is a hypocrite. It's well known Google tries to keep you on Google by parsing more and more information from websites and summarizing it. Ex, Wikipedia summaries, IMDB Scores, Review Stars, etc...

If you have a problem with ChatGPT's "scraped data", then you have more fundamental issues with how the internet is as it is today.

flangola73y ago

Google makes money when you click links and visit webpages. Instant info features are useful but do not directly bring Google money.

dmak3y ago

That's my point?

If the product is scraping the data and presenting it on their website like ChatGPT and Google, then that's effectively the same as taking away the ad revenue from those websites because they aren't getting the impressions.

flangola73y ago

You're confused. Where there is an ad impression (a user clicking to go to a webpage from a Google search result), that webpage pays Google for bringing them traffic. If the user never clicks the ad because Google directly presented the info natively, Google doesn't make any money.

I can only make the guess that Google offers this to remain competitive against Bing; it both reduces their income and increases their tech stack.

detaro3y ago

Google makes money when you look at and click ads. Visiting websites has a risk that it takes you to a site that does not show you Google ads.

williamcotton3y ago

Ah, our daily dose of a bunch of people with basically no understanding of copyright law or even the basic concepts of tort or common law jurisprudence make all sorts of silly anthropomorphic arguments about “how computers think”.

Please, people, learn how to focus your thoughts. Go read up on copyright law in the United States. If you go into learning about copyright law trying to justify your own preconceived notions you will gain nothing.

venv3y ago

Absolutely this, well put. I guess enough people misunderstand AI models to the point of treating them like they are not software. I guess this validates Clarke's third law (Any sufficiently advanced technology is indistinguishable from magic).

oceanplexian3y ago

The dirty secret of how so many social media giants got their initial traction in the early growth stage they scraped content. LinkedIn is one I have personal knowledge of. Facebook another. How do you think they got a critical mass of users? Scraping and fake engagement. Back in the 00's when they were startups operating in little offices in the SF Bay, they had teams of people running Beautiful Soup and were building bots to build profiles and stuff.

I'm actually not really sure I have an opinion on the ethics of it. Same argument as Adblock. You don't get to control how people consume your content if you put it out in the world for free. That goes for profiles, or articles, reddit posts, StackOverflow, etc. The only thing that's ironic is that large tech companies throw a fit whenever you want to turn the tables and scrape them.

moneywoes3y ago

Didn’t LinkedIn use people’s phone books

anthropodie3y ago

Well you could say same thing about the answers that Google displays on it's pages instead of search results! If you don't want these crawlers to index your content I am pretty sure you can disable via robots.txt just like Google.

ssharp3y ago

The training model data sets have inconsistent respect for robots.txt. Also, I believe most of these models are not continuously crawling websites to update their data like a search engine does. That means if you're crawled once, you may not be crawled again and you'll still be in the datasets.

I'd also argue that Google directing traffic to your website is a good alignment of incentives. ChatGPT spitting out answers derived from your work with nothing given back to you in return is not.

tqwhite3y ago

I bet that fully half the time, I read the google answer, click on nothing and go on my way.

altdataseller3y ago

That's still better than 0%

2OEH8eoCRo03y ago

The idea that a robots.txt will save you is laughable.

z3c03y ago

Agreed. At best, you can disallow: / and hope they're polite enough to listen.

I can't seem to find anything on OpenAI's crawler agent, so I'm skeptical they're considering robots.txt at all.

2OEH8eoCRo03y ago

Even if they abide, this is capitalism. Somebody who wants an edge won't. Or OpenAI or Google will get desperate and stop abiding.

JohnFen3y ago

True. Robots.txt is already a very weak thing. I disallow all access using robots.txt, but there are many crawlers who ignore it and I have to maintain an overt blocklist for them.

hobs3y ago

It's the lack of attribution that really hurts, though I think its fairly shady of google to steal the ad revenue from smaller sites.

theRealMe3y ago

You can ask ChatGPT to cite its sources.

jefftk3y ago

You can ask it to, but it will just make up sources. The connection between its knowledge and the original sources is not represented in the model.

(This is an active area of research, though, and version of GPT that could cite its sources is something people widely agree would be valuable.)

Spooky233y ago

You can ask for the source of information.

altdataseller3y ago

They still link to the source though. Even when they show a snippet.

JoeAnon3y ago

All I know is that while this isn't a new issue, the likes of ChatGPT has brought it to a head and made it more urgent. I am seriously reconsidering whether or not I want my writings to be available on the internet at all. I object to many of the uses, including this, they can be put to, and not publishing them online appears to be the only control available.

For now, I have removed my existing works, both technical and creative, from the internet and won't be adding more while I try to work out what to do.

tqwhite3y ago

Finally, a good answer from a content creator. It is wrong to try it hobble and control all of us with ever more detailed laws. It is right to choose to participate carefully.

spaintech3y ago

The importance of source citation in ChatGPT's responses is a topic of debate, particularly as the platform shifts towards a paid model. While ChatGPT is designed to deliver information in a conversational and user-friendly way, it is important to consider the potential legal implications of using unverified or uncited information. In sensitive or controversial cases, it is advisable to properly cite sources to ensure accuracy and avoid any potential issues of intellectual property infringement.

On the other hand, the focus on the potential of ChatGPT's natural language processing capabilities highlights the significance of learning and using LLM (Language Models) in data handling. The utilization of LLM can potentially lead to a future where traditional databases become obsolete and are replaced by advanced language models. As such, the development and integration of LLM in our daily lives and processes can bring about many benefits and possibilities.

tqwhite3y ago

No. It's not. Also, it's not unfair if I study someone's work and then learn from it. Also, it's not unfair if you see my internet present and are inspired to do similar things.

At some point participating in the internet means your stuff is going to be seen. I wear glasses to read web content. I don't think the glasses company should pay royalties for what I read. chatGPT is a tool that allows me to understand and use the information people put onto the internet better.

Far from a matter of fairness, this is simply another way that selfish people are trying to monetize the future, to make it more and more difficult and expensive for others to participate.

"I've always wished I could charge everyone one earth. chatGTP looks like the future. If I can tap the money flow there I will get mo' money."

I'm against it.

FoomFries3y ago

You’re conflating human work and machine work. I think that’s the real argument here - it’s fine to study someone’s work and learn from it as a human, because that takes impossibly more time and produces impossibly less total works than chatGPT doing the same thing.

I’m not arguing for either side, just pointing out that we need to carefully consider what rights AI should share with humans.

elforce0023y ago

But they are going to monetize regardless of what we want. If that's the case, they need to retribute any copyrighted content,etc...

You're being selfish too. How do you think we have phones, etc... ? Capitalism applies to knowledge too.

ChatGPT took advantage of that and wants to monetize it while cutting people who spent time, money, resources, etc... Just like copilot, plain and simple.

pharke3y ago

You can also invert this and say that without a system like ChatGPT it is physically impossible for most people to find or use those 570GB of data. A search engine can only get you so far and over time they are becoming less useful as the net floods with junk content. If you don't even know what terms to search for then ChatGPT wins out since you can start with a very simple question and then interrogate it further on details it produces. The best way to think about it is as a better search engine, a fully interactive one that also has some degree of its own agency when it comes to synthesizing data. It could be better, it would be nice to have the option to show sources for the output so that you can verify the facts or do your own research.

peyton3y ago

Google piggybacks on the same sorts of data to rank results and display ads without compensating site owners. They track billions of people without paying them. I don’t think it’s any more unfair if OpenAI built a better product.

gchamonlive3y ago

I think chatgpt just exacerbates a problem that was already pervading the free internet business model, which is that Ad revenue model is outdated and exhausted without a clear alternative.

It maybe was unfair to telephone operators when connection automation was implemented, as it made operators obsolete, but the older model couldn't scale, the same way reading text from source doesn't scale for human productivity.

tqwhite3y ago

Telephone operators. Definitely a nice update to the buggy whip metaphor. Thanks.

Also, I agree exactly. Advertising is increasingly useless. It's a tax on knowledge and it's gross. I can't wait for it to die.

I want to only pay for the stuff I use.

corobo3y ago

This argument came up a bunch a while back. I settled on the opinion that while it's possible to buy summaries of books, I don't give a fart in a breeze where ChatGPT got it's data.

E.g. Summary of How to Win Friends and Influence People: Effective Steps to Better Interpersonal Relationships by Book Lyte

ChatGPT does more of a mashup with the learned data than humans need to, that'll do me.

lost_tourist3y ago

The problem is this perspective is from copyright owners and not chatgpt users. It's fine if you don't care, but what matters is--do courts and lawmakers care. Today is probably the right time to get started on it for those types.

jodrellblank3y ago

> “Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?”

We can only hope. It’s unfair to someone that my browser can ask your server for a page, I see an ad for random bullshit nobody would ever care about, and money changes hands behind the scenes and that counts as an economic transaction which boosts GDP. It’s unfair (in my favour) that I can piggy back off this to get things for free.

And when I say “someone“ I suspect “everyone”. Sadly spending money advertising “Yorkshire woman finds guaranteed way to win on the horses” doesn’t seem to have caused anyone to run out of money and have the whole thing collapse yet. And it’s unfair on real small businesses with products paying for adverts which people don’t see or are clicked by bots or are misreported and all they can do is throw money at Google and Facebook and hope.

bediger40003y ago

I kind of agree with you, but I think that's only because we've all been saturated with the idea of everlasting ownership of ideas.

Clearly, ownership of ideas runs out, because we all use linked lists or binary trees, or paper, or turbines or the list goes on. We don't pay money to the inventors of linked lists, or the heirs or successors-in-interest to the inventor of paper. Why not? When does ownership of an idea expire? Why do we unconsciously accept copyright or patent limits of today?

There's also an issue with simultaneous invention, but that's out of scope here. Clearly ChatGPT is just regurgitating or otherwise emitting previously-ingested material.

beardyw3y ago

When this becomes properly entrenched I fear that it may create a disincentive to create original content. If that happens we will all be poorer for it in return for amazing access to what we already have. I don't think it is a good deal.

tqwhite3y ago

It's my opinion that royalties are the reason we have so much horrifying junk in our culture. It has created a world where we are inundated with cultural garbage that people produced only to squeeze money out of copyrights.

I dream of royalties going away so that we only original content that was made for the love of expression, a feeling that it's important. I would be happy to have a LOT less stuff to look at if I didn't have to sift through so much garbage.

Of course, I am also in favor of UBI so that those creators can eat while they are doing it.

pencilcode3y ago

At the minimum, systems like chatgpt should be forced to link to their sources, so it gives something back and so that its assertions can be verified - right now, it’s just good at bullshitting through questions.

snshn3y ago

It's akin to saying that everyone who writes a book today must give credit to everyone who contributed to the creation of modern written language and printing tools.

pixl973y ago

Yea, we've been seeing these posts over and over again on this same topic, and most of it for me boils down to

"If you applied the same set of rules to a human, how exactly would that look"

Simply put culture is the copying of each others ideas. When one of us started banging rocks together to make them sharp they didn't sell this idea to others, at best they traded sharpened arrows for something else.

The big issue with humans is we are commonly very conservative in our ideas. "Yesterday I did X, today I did X, and tomorrow I'll do X", fine and dandy until tomorrow a machine does X for nearly free. Instead of figuring out how to adapt our economic systems to deal with new systems of cheap and plenty the fearful and the greedy are looking for ways to maximize the amount they can profit or hold it back to maintain status quo.

Brian_K_White3y ago

I think the copyright ethicality of the current class of AIs is about like religion or guns.

Discussion is pointless because everyone already has an opinion and it's very firm.

mkl953y ago

Big Tech companies have been scraping massive amounts of data for about two decades. Many smaller companies have tried to imitate them (remember when Big Data was the hottest thing out there? How do you think most of those startups obtained their data?) but pretty much all of them failed, mainly by running out of cash. OpenAI just happened to win the scraping lottery.

distantsounds3y ago

"your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."

bdhcuidbebe3y ago

> I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them

Google is doing this in search results for years, so does bing. apple also does this in their built in dictionary.

why rant about chatgpt that currently at least is a small company in comparison.

fatneckbeard3y ago

chatgpt actually has some ideas about this

question: How could the people who generate used in an ai language model be paid for their work?

answer: There are several ways in which the people who generate content for an AI language model could be paid for their work:

    Royalty-based payment: Content creators could receive a percentage of the revenue generated from the use of their content in the AI language model.

    Token-based payment: If the AI language model is built on a blockchain, content creators could be paid in tokens that could be traded for cryptocurrency or fiat currency.

    Partnership with content publishers: The developers of the AI language model could partner with content publishers to compensate the creators of the scraped content.

c7b3y ago

Seems like this AI model doesn't understand very well how it works[0], which I suppose can be explained (maybe there aren't that many explanations in the corpus), but it's also quite ironic.

[0] The answers focus on the technicalities of how the payments could be arranged, but the much bigger problem is that it's not clear who the payments should be going to (there's no immediately obvious or unique way of attributing a given output to specific training inputs, that would require a separate model with a lot of room for judgement/modelling decisions; or a new type of LLM that has that feature baked in).

jijji3y ago

The data it was scraped from was then put into vector maps and usedd to create a model which is used from zero to create unique sentences that summarize what the model relates to. The text results coming out are neither copyright infringement nor plagiarism.

etiam3y ago

You're saying plagiarism isn't if one mostly swaps a couple of things in the expression of the content?

jijji3y ago

the definition of plagiarism is "the practice of taking someone else's work or ideas and passing them off as one's own"...chatGPT might infer from thousands or millions of different possible works or ideas before creating its own sentences so I don't think that meets the definition of plagiarism

https://www.springboard.com/blog/data-science/machine-learni...

In general no, but there is a problem in that ChatGPT may end up "regurgitating" large chunks of source material regardless, even if mechanistically that's not what it's trying to do. Similarly it's been recently reported that Stable Diffusion has effectively memorized some entire images it was trained on, and is capable of generating those as output.

I don't think the word-by-word statistical mechanism of ChatGPT would stand up as a copyright defense in court. It's the output that counts, not the means of getting there. It'd be like me copying some copyright work word-for-word then trying to claim "well, your honor, I was only using that for inspiration, I was using my full creative abilities to write what I did, so you can't blame me if it's a word-for-word copy".

I think OpenAI (or any company with the resources to train such a model in the first place) could fairly easily self-police and check that what they are generating isn't an exact (or almost exact) copy of something it was trained on. It's a bit like the app Shazam/similar recognizing a song from a short snippet - you just need to generate some type of "hash code" for each generated sentence (or whatever level of granularity makes sense) and compare it to a database of "hash codes" from the source material it was trained on.

acadiel3y ago

There needs to be a “raw source” option that puts links to everything the model spits out that the user can enable or disable. This can give credit to whatever the model cites, and also help us understand a little of how it’s linking things together.

lr4444lr3y ago

Not any more than I don't need to keep paying my teacher once I learn what she knows. ChatGPT's value isn't in what it knows: it's in what it understands from your prompt in terms of that sea of information.

unnouinceput3y ago

1- It's on internet. If it's on Google site then it's free. If you want then use robots.txt (not that ever stopped google's spider to index your pages)

2 - Code was trained from GitHub. GitHub is Microsoft. OpenAI is Microsoft money. So Microsoft trained its AI on Microsoft code. You disagree? Then GTFO from GitHub and don't feed Microsoft your code anymore.

3 (the most important point) - Q: "Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?"

Fuck YEAH!! please do so. I hope the shit show that ad model is crashes and burn to the ground. You can't use internet without having a solid armor on you with uBlock Origin and/or NoScript (or PiHole if you want the same readable experience on rest of your house devices).

naillo3y ago

Never realized how little data it was fed with. 570GB can fit on my laptop.

shagie3y ago

The original dataset was 45 TB.

The neural net model is condensed to 800 GB.

Note that the "compression" there also includes the "intelligence" that it presents - you might be able to get some powerful compression of English text... but you can't ask a gzip file to come up with a joke about cats and dinosaurs.

Pigalowda3y ago

1GB file would contain roughly 166,000,000 words. This includes the space between words, so the average word is 5 characters.

A typical single-spaced page is 500 words long

That’s 179,280,000 full pages of text.

I wonder if they excluded any duplicated text.

bdhcuidbebe3y ago

But its not just words…

Pigalowda3y ago

I thought LLM were fed text only in their training data set?

I’ve only done image classifiers and object detectors so I was assuming they must be trained with similar pure datasets.

voisin3y ago

> Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?

Hopefully. This would be the best outcome I can think of for the Internet.

christkv3y ago

Anybody have an idea of what kind of hardware (and cost) you would need to train the model and to execute it ?

Obviously storage is not a major factor here.

I seem to recall that the training cost for ChatGPT was in the tens of millions of dollars. Execution cost is on the order of ~$1 per interaction.

shagie3y ago

The cost per day isn't something that there are any reliable sources for.

The closest to an authoritative source on it is https://twitter.com/sama/status/1599671496636780546

> average is probably single-digits cents per chat; trying to figure out more precisely and also how we can optimize it

An attempt to work through it from related resources is https://twitter.com/tomgoldsteincs/status/160019698195510069...

In particular https://twitter.com/tomgoldsteincs/status/160019699090561433...

> So what would this cost to host? On Azure cloud, each A100 card costs about $3 an hour. That's $0.0003 per word generated.

> But it generates a lot of words! The model usually responds to my queries with ~30 words, which adds up to about 1 cent per query.

---

It is much less than $1/interaction.

throwaway88293y ago

On that argument, I could see publishers trying to sue. If you ask GPT:

> What's the New York Times scrambled egg recipe?

GPT returns the exact recipe. If I were NYT I'd be frustrated. Their content is now showing without the ad views or paywall.

Pigalowda3y ago

This reminds me a bit of the criticism of “black box logic” for ML models.

Is there something analogous to saliency maps for LLM?

raydiatian3y ago

My feeling is that one of the four happen:

1) AI is open sourced and we adapt stably. Either everybody has the opportunity to be their own business, or there is UBI.

2) AI is open sourced but it is unfairly distributed. Only some people are suited to BTOB, and/or UBI is shit.

3) AI is not open sourced, the wealthy edge out mankind and a planet scale genocide occurs.

4) none of it matters because the looming war between the US & China explodes or climate change wipes us out in any meaningful capacity that could pursue AI.

Given the track record of our species, #1 feels like wishful thinking

bee_rider3y ago

The ad-based model of the internet is bad anyway. I don’t think chatGPT will break it, but we can hope!

omernomer3y ago

I think it should reference the sources of the information, similar to any research paper or essay.

shagie3y ago

Could you cite the need to use citations in common conversation or asking it to write jokes?

If you are using GPT as a research tool as opposed to asking your friend who is an expert int the subject, are you citing your friend when you write the paper - or are you going back and finding sources that then back your friend's point up?

bilsbie3y ago

Why should it be less fair than what a search engine does?

It’s really just building a better model.

JohnFen3y ago

A search engine directs people to the original work. This doesn't.

dmak3y ago

Have you used Google in the last 5-10 years? It's been slowly parsing more and more information from websites so you never leave Google.

JohnFen3y ago

I'm aware of that, but I don't really consider that a part of search.

hamburga3y ago

Maybe unfair is the wrong word. I think most agree that scraping, even at a massive scale -- is in itself fair. But is it sustainable?

Will LLMs drive interest/activity away from wikipedia.org? Will it put its own sources of high-quality ad-supported content -- wikihow.com, for example (though I can't be totally sure it scraped from there) -- out of business? Or is there an earth-shattering copyright suit against OpenAI in the works as we speak?

> Can this start breaking the ad-based model of the internet

Is the alternative that everything is behind some kind of paywall by default, to block scraping? Is that where we're heading?

dd363y ago

Do you want companies to do this in private for private gain and not share it with you? Because making it illegal will just make it happen in greater secrecy.

JohnFen3y ago

This is essentially a defeatist argument flirting with supporting extortion, it seems to me. If you think that chatgpt is doing something wrong, this is arguing that you should allow the wrong to exist because there's nothing you can do about it.

In other areas of society where a bad thing cannot be stopped, we still use legislation to reduce the amount of it and mitigate some of the harm.

dd363y ago

I don't think they're doing anything wrong. Indeed, I think they're performing a public service that none of the others seemed positioned to do. The default is to keep advances private.

sublinear3y ago

I agree. ChatGPT should cite its sources.

[0] https://news.ycombinator.com/item?id=31079231

I'm pretty sure ChatGPT doesn't know it's sources. If it generated "that cat sat on the mat", then what (even from a theoretical POV) is the source of the word "mat" ? Note that it's not pulling the whole "cat sat on the mat" sentence from anyplace - that's not how it works - it's just generating this one word at a time based on the statistics (collected over all the text it was fed) of what word is most likely to follow what came before.

So, who gets credit for the word "mat" being generated in that context ? I guess any texts talking about cats and mats in close proximity may deserve some of the "credit", but it goes way deeper than that since why did ChatGPT choose to output such a trite sentence (albeit while only selecting one word at a time), rather that something else about cats or perhaps a more interesting thing that cats often sit in/on ...

People seem to assume that ChatGPT is pulling entire "facts" from various sources, but that's just not how it works - it's just feeding all the texts into a giant meat grinder of word statistics. It knows about words, not facts.

sublinear3y ago

> People seem to assume that ChatGPT is pulling entire "facts" from various sources, but that's just not how it works - it's just feeding all the texts into a giant meat grinder of word statistics. It knows about words, not facts.

...yes?

OpenAI: "challenge incorrect assumptions"

paulcole3y ago

Yes, it’s 100% unfair but the net gain to society will be worth it in the long run. Got to break some eggs to make an omelet.

afrcnc3y ago

yes... and sometimes it's straight out copyright theft

shagie3y ago

Without asking it to reproduce copyrighted information, do you have any examples of this? Please remember to cite your sources.

sourcecodeplz3y ago

It's your fault for making your IP free and public. Instead of posting for free on your web property, do it in a book that you charge for.

cebert3y ago

Is it unfair that to present coworkers thoughts you summarized or derived after reading ad-supported content?

etiam3y ago

Is that something you think is a good analogy for ChatGPT use of it's data sources?

To me it looks more like memorizing enough of other employees' project contributions to try passing it all off as your own achievements in performance review.

glofish3y ago

if it is ad-supported then you "bought and paid for" that content, no?

so in that case it wouldn't be unfair ... :-)

did the ChatGPT pay for the content it is using? that was the original question...

jhoelzel3y ago

Only if you learning the same things is cheating too.

"Copyright" "ingenuity of thought" etc are concepts that need to be overhauled since a lot more people now have access to higher education.

mikewarot3y ago

How could training an AI on the works of Shakespeare possibly be unfair to him? Or to any other long dead person? - I don't see any issues

How could training an AI on the works of someone who has already been paid for them be unfair? - Possibly because it effects their future marketability and income?

Current authors, artists, internet commenters, clearly have an interest in the results of their creative endeavors being used for gain that they won't benefit from. This is very similar to the extractive monopolies of YouTube and the rest of social media. Their profit at our expense.

j / k navigate · click thread line to collapse

183 comments

ergonaught3y ago

Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

_qzu43y ago

> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

Because it's false equivalence? ChatGPT isn't a human being. It's a product that is built upon data from other sources.

The question is if this data is legal to scrape, which it is: Web scraping is legal, US appeals court reaffirms [https://news.ycombinator.com/item?id=31075396].

As long as the content is not copyrighted and it's not regurgitating the exact same content, then it should be okay.

Retr0id3y ago

Being allowed to scrape something does not absolve you of all intellectual property, copyright, moral, etc. issues arising from subsequent use of the scraped data.

dawsoneliasen3y ago

faktory3y ago

ChatGPT isn't doing the scraping, humans are. And humans are using computers to both read the article and create content or to scrape it.

So not it's not a false equivalence.

anileated3y ago

There’s a reason scraping is a legally grey area.

> Web scraping is legal, US appeals court reaffirms

First, the case is not closed. [0]

dmak3y ago

How is it not scraping? There's no other way to get all that data for training a model without scraping.

Check me on this because I'm not a software person:

jMyles3y ago

> The question is if this data is legal to scrape

...it is? I didn't see that question raised in OP's text at all. What do legacy human legalities have to do with how AI will behave?

> Because it's false equivalence? ChatGPT isn't a human being.

Is this important? What is so special about human learning that it puts it in a morally distinct category from the learning that our successors will do?

It sounds like OP is concerned with the ad-driven model of income on the internet, and whether it requires breaking in order for AI to both thrive and be fair.

venv3y ago

>Is this important?

bilsbie3y ago

> It's a product that is built upon data from other sources.

To be fair, so are you.

nCave3y ago

Does anyone actually find these arguments persuasive?

There is really no reason to believe that what chatGPT or stable diffusion does is anything like what "your brain" does--except in the most superficial, inconsequential way.

Second, try applying this logic to literally anything else and you'll see why it's absurd:

"You can't ban cars from driving on sidewalks! If it's acceptable for people to walk on sidewalks, then it has to be acceptable for cars to drive on sidewalks, since it's just automated walking"

"You can't ban airplanes from landing in ponds. They fly 'just like' ducks fly! So if it's acceptable for them, it must be acceptable for airplanes too"

dawsoneliasen3y ago

Yes, and: why shouldn’t it matter that in one case it is a person and in another it is a computer program?

BurningFrog3y ago

I find the argument pretty persuasive.

I also agree it's not the only argument and ultimate proof.

hannob3y ago

ericd3y ago

People don’t remember the sources that formed their opinions, it’s just baked into the structure of their brain after reading, same for the model.

stevesearer3y ago

With search engines, it does feel like there was is a more clear trade of scraping access in exchange for web traffic.

With ChatGPT the traffic benefit isn’t there, so it feels like it isn’t a fair trade.

Google adding the context and data to their search results page also started blurring this trade making it unnecessary to click to the site the info was cleaned from.

rom-antics3y ago

candiodari3y ago

It's getting me to the point of refusing to use Google, or only use Google with "site:...". I mean, the site varies, but without site limits Google's becoming useless.

lolinder3y ago

> You can't strap a source-crediting mechanism on top of a transformers-based model after the fact.

christkv3y ago

The bing leak seemed to mention sources.

pixl973y ago

Hello [Oxford Dictionary: 1827]

Oh, wait, I'm not going to cite sources in a non-scientific work as this leads to madness. The following is a previous post of mine on HN

klpr3y ago

No, it isn't that simple. The scale and totality of the scraping is out of reach for a human.

If you previously interacted with people on this issue, you must know that.

It is fair for a single human to breathe, but not for a machine to use all oxygen on this planet at once, killing everyone else in the process.

rom-antics3y ago

If I woke up tomorrow and breathed all the oxygen, nobody else could breathe. But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.

Air is zero-sum. Knowledge is not.

> But If I woke up tomorrow and read all the websites on the internet, it wouldn't stop other people from reading them too.

So yes, eventually you'd stop people from reading them too.

ergonaught3y ago

It is in fact that simple. There are dozens, hundreds, perhaps thousands of legitimate, genuine, serious, real reasons to be concerned and "want something to be done". This isn't it.

"Learning is unfair" is not an argument you want to win.

wiseowise3y ago

Love how you’ve conveniently ignored

> The scale and totality of the scraping is out of reach for a human.

nohuck133y ago

> It is in fact that simple.

Why?

nickfromseattle3y ago

>Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

The difference is in scale.

OpenAI can consume all of the video game artists, ever, create an art design product and capture a significant percentage of the economic productivity of video game art.

2OEH8eoCRo03y ago

At a human level it falls below the noise floor. It's a fact of life that humans will learn and build from experience.

The difference is scale. At scale it becomes a problem.

ericd3y ago

2OEH8eoCRo03y ago

It is a wonderful tool but I still feel that the creators of the training data are getting shafted. I'm both amazed and horrified at our creation and what it portends.

ergonaught3y ago

That's got nothing to do with whether or not it is "fair" for a learning system to produce content after it has learned.

https://en.wikipedia.org/wiki/Luddite

belltaco3y ago

Isn't that like saying automated looms should be banned because it meant humans would lose jobs to it? Or buggy whip drivers wanting to ban cars?

Might as well ban computers since they automated and eliminated a lot of manual jobs.

The problem of humans with no money should be solved by a societ safety net and things like UBI.

Well, the luddites were right in that they were fighting a good and honorable fight.

Who knows, maybe the internet would have been better off if some people were brave enough to smash some of Google's servers in like 2006..

hn_throwaway_993y ago

BeFlatXIII3y ago

Teach the suicidal ones the noble art of suicide bombing and force societal change that way.

wiseowise3y ago

> Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?

I don’t do it on an industrial scale.

anileated3y ago

It’s fair if you do this; neither fair nor legal [0] when a commercial for-profit entity, backed by a large corporation, does it at scale and capitalizes on that.

CPLX3y ago

Your argument contains the implicit assumption that we have the same rules for machines that we do for humans.

That is trivially disproved, as is the rest of your argument that follows from it as a premise.

xg153y ago

If you treat ChatGPT like a human, how high is the salary you are going to pay it?

lukev3y ago

I think this is a real concern, but imagine a couple other scenarios:

2. Many books summarize and restate other books. If I read Cliff's Notes on a book, for example, I can learn a lot about the original book without buying it. Is this depriving the author?

3. I have a website that proxies requests to other websites and summarizes them while stripping out ads.

So which of these examples are a better metaphor for what a LLM does?

anileated3y ago

It is not breaking the ad-based model—it’s breaking open information sharing culture as we know it.

What exactly are the incentives to publish information openly in that world?

(Will they even believe you if you say you’re the one who did the niche research powering some specific ChatGPT answer, in a world everyone knows that you can just ask an LLM?)

pharke3y ago

anileated3y ago

> 4) They ask the LLM for a list of recent books that go in depth on the topic or are in the genre etc. 5) Your name comes up in the list

> Why would someone only ask an LLM questions when they were in the market to buy a book?

pharke3y ago

ForestCritter3y ago

dmarcos3y ago

That’s my main fear. Not the fairness / unfairness but that people might be less willing to share info and a lot becomes inaccessible / secret.

anileated3y ago

dmarcos3y ago

daevout3y ago

themodelplumber3y ago

> Artists are already in full rebellion against this,

Huh? No. Some artists are maybe?

> as they should be, being nearly eclipsed by AI

Not even close. It's like looking at the newest brand of clip art.

Non-artists don't (maybe can't) know that particular feeling, at least not with regard to being told you're angry about "what's supposed to look like art".

(Heck, artists have been told that with regard to other humans' art for centuries, for one)

Going even further, a lot of artists already know how to build on this new tech without ripping people off.

I used to teach college art classes and would have loved to integrate this topic into the curriculum. It'd be a great ongoing discussion, no matter the legal outcomes.

Sakos3y ago

jostiniane3y ago

1. get to enjoy an open network of networks

2. people share, get creative and get some sort of credit for it

3. scrap it all and feed it a large deep neural network and be a worse version of all this content but easily accessible

4. creative people don't see a reason to keep sharing what they have (no new public books, no new open source projects, ...)

5. get stuck in an AI world of recycled content

dorchadas3y ago

> but techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.

blue_cookeh3y ago

100% agree - the likes of ChatGPT are straight up generating revenue based on adding value to stolen work.

pixl973y ago

Lets turn this around the other way.

I create a omniscient copyright detection bot and face it at everything you create 24 hours a day 7 days a week.

You go home and sing happy birthday to your kid. The bot gives you a non-monetary warning for using a copyrighted work without permission. No big deal, but it is on your permanent record.

paulcole3y ago

The first two situations you mention almost certainly aren’t copyright violations. The third is at least a solid “maybe.”

elcomet3y ago

They aren't, or they shouldn't be, but that's the point of the parent's comment.

bediger40003y ago

JohnFen3y ago

dmak3y ago

All of a sudden everyone has issues with free information when Google has been doing this for years.

dmak3y ago

If you have a problem with ChatGPT's "scraped data", then you have more fundamental issues with how the internet is as it is today.

flangola73y ago

Google makes money when you click links and visit webpages. Instant info features are useful but do not directly bring Google money.

dmak3y ago

That's my point?

flangola73y ago

I can only make the guess that Google offers this to remain competitive against Bing; it both reduces their income and increases their tech stack.

detaro3y ago

Google makes money when you look at and click ads. Visiting websites has a risk that it takes you to a site that does not show you Google ads.

williamcotton3y ago

venv3y ago

oceanplexian3y ago

moneywoes3y ago

Didn’t LinkedIn use people’s phone books

anthropodie3y ago

ssharp3y ago

I'd also argue that Google directing traffic to your website is a good alignment of incentives. ChatGPT spitting out answers derived from your work with nothing given back to you in return is not.

tqwhite3y ago

I bet that fully half the time, I read the google answer, click on nothing and go on my way.

altdataseller3y ago

That's still better than 0%

2OEH8eoCRo03y ago

The idea that a robots.txt will save you is laughable.

z3c03y ago

Agreed. At best, you can disallow: / and hope they're polite enough to listen.

I can't seem to find anything on OpenAI's crawler agent, so I'm skeptical they're considering robots.txt at all.

2OEH8eoCRo03y ago

Even if they abide, this is capitalism. Somebody who wants an edge won't. Or OpenAI or Google will get desperate and stop abiding.

JohnFen3y ago

True. Robots.txt is already a very weak thing. I disallow all access using robots.txt, but there are many crawlers who ignore it and I have to maintain an overt blocklist for them.

hobs3y ago

It's the lack of attribution that really hurts, though I think its fairly shady of google to steal the ad revenue from smaller sites.

theRealMe3y ago

You can ask ChatGPT to cite its sources.

jefftk3y ago

You can ask it to, but it will just make up sources. The connection between its knowledge and the original sources is not represented in the model.

(This is an active area of research, though, and version of GPT that could cite its sources is something people widely agree would be valuable.)

Spooky233y ago

You can ask for the source of information.

altdataseller3y ago

They still link to the source though. Even when they show a snippet.

JoeAnon3y ago

For now, I have removed my existing works, both technical and creative, from the internet and won't be adding more while I try to work out what to do.

tqwhite3y ago

Finally, a good answer from a content creator. It is wrong to try it hobble and control all of us with ever more detailed laws. It is right to choose to participate carefully.

spaintech3y ago

tqwhite3y ago

No. It's not. Also, it's not unfair if I study someone's work and then learn from it. Also, it's not unfair if you see my internet present and are inspired to do similar things.

Far from a matter of fairness, this is simply another way that selfish people are trying to monetize the future, to make it more and more difficult and expensive for others to participate.

"I've always wished I could charge everyone one earth. chatGTP looks like the future. If I can tap the money flow there I will get mo' money."

I'm against it.

FoomFries3y ago

I’m not arguing for either side, just pointing out that we need to carefully consider what rights AI should share with humans.

elforce0023y ago

But they are going to monetize regardless of what we want. If that's the case, they need to retribute any copyrighted content,etc...

You're being selfish too. How do you think we have phones, etc... ? Capitalism applies to knowledge too.

ChatGPT took advantage of that and wants to monetize it while cutting people who spent time, money, resources, etc... Just like copilot, plain and simple.

pharke3y ago

peyton3y ago

gchamonlive3y ago

I think chatgpt just exacerbates a problem that was already pervading the free internet business model, which is that Ad revenue model is outdated and exhausted without a clear alternative.

tqwhite3y ago

Telephone operators. Definitely a nice update to the buggy whip metaphor. Thanks.

Also, I agree exactly. Advertising is increasingly useless. It's a tax on knowledge and it's gross. I can't wait for it to die.

I want to only pay for the stuff I use.

corobo3y ago

This argument came up a bunch a while back. I settled on the opinion that while it's possible to buy summaries of books, I don't give a fart in a breeze where ChatGPT got it's data.

E.g. Summary of How to Win Friends and Influence People: Effective Steps to Better Interpersonal Relationships by Book Lyte

ChatGPT does more of a mashup with the learned data than humans need to, that'll do me.

lost_tourist3y ago

jodrellblank3y ago

> “Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?”

bediger40003y ago

I kind of agree with you, but I think that's only because we've all been saturated with the idea of everlasting ownership of ideas.

There's also an issue with simultaneous invention, but that's out of scope here. Clearly ChatGPT is just regurgitating or otherwise emitting previously-ingested material.

beardyw3y ago

tqwhite3y ago

Of course, I am also in favor of UBI so that those creators can eat while they are doing it.

pencilcode3y ago

snshn3y ago

It's akin to saying that everyone who writes a book today must give credit to everyone who contributed to the creation of modern written language and printing tools.

pixl973y ago

Yea, we've been seeing these posts over and over again on this same topic, and most of it for me boils down to

"If you applied the same set of rules to a human, how exactly would that look"

Brian_K_White3y ago

I think the copyright ethicality of the current class of AIs is about like religion or guns.

Discussion is pointless because everyone already has an opinion and it's very firm.

mkl953y ago

distantsounds3y ago

"your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."

bdhcuidbebe3y ago

> I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them

Google is doing this in search results for years, so does bing. apple also does this in their built in dictionary.

why rant about chatgpt that currently at least is a small company in comparison.

fatneckbeard3y ago

chatgpt actually has some ideas about this

question: How could the people who generate used in an ai language model be paid for their work?

answer: There are several ways in which the people who generate content for an AI language model could be paid for their work:

    Royalty-based payment: Content creators could receive a percentage of the revenue generated from the use of their content in the AI language model.

    Token-based payment: If the AI language model is built on a blockchain, content creators could be paid in tokens that could be traded for cryptocurrency or fiat currency.

    Partnership with content publishers: The developers of the AI language model could partner with content publishers to compensate the creators of the scraped content.

c7b3y ago

Seems like this AI model doesn't understand very well how it works[0], which I suppose can be explained (maybe there aren't that many explanations in the corpus), but it's also quite ironic.

jijji3y ago

etiam3y ago

You're saying plagiarism isn't if one mostly swaps a couple of things in the expression of the content?

jijji3y ago

https://www.springboard.com/blog/data-science/machine-learni...

acadiel3y ago

lr4444lr3y ago

unnouinceput3y ago

1- It's on internet. If it's on Google site then it's free. If you want then use robots.txt (not that ever stopped google's spider to index your pages)

3 (the most important point) - Q: "Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?"

naillo3y ago

Never realized how little data it was fed with. 570GB can fit on my laptop.

shagie3y ago

The original dataset was 45 TB.

The neural net model is condensed to 800 GB.

Pigalowda3y ago

1GB file would contain roughly 166,000,000 words. This includes the space between words, so the average word is 5 characters.

A typical single-spaced page is 500 words long

That’s 179,280,000 full pages of text.

I wonder if they excluded any duplicated text.

bdhcuidbebe3y ago

But its not just words…

Pigalowda3y ago

I thought LLM were fed text only in their training data set?

I’ve only done image classifiers and object detectors so I was assuming they must be trained with similar pure datasets.

voisin3y ago

> Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?

Hopefully. This would be the best outcome I can think of for the Internet.

christkv3y ago

Anybody have an idea of what kind of hardware (and cost) you would need to train the model and to execute it ?

Obviously storage is not a major factor here.

I seem to recall that the training cost for ChatGPT was in the tens of millions of dollars. Execution cost is on the order of ~$1 per interaction.

shagie3y ago

The cost per day isn't something that there are any reliable sources for.

The closest to an authoritative source on it is https://twitter.com/sama/status/1599671496636780546

> average is probably single-digits cents per chat; trying to figure out more precisely and also how we can optimize it

An attempt to work through it from related resources is https://twitter.com/tomgoldsteincs/status/160019698195510069...

In particular https://twitter.com/tomgoldsteincs/status/160019699090561433...

> So what would this cost to host? On Azure cloud, each A100 card costs about $3 an hour. That's $0.0003 per word generated.

> But it generates a lot of words! The model usually responds to my queries with ~30 words, which adds up to about 1 cent per query.

---

It is much less than $1/interaction.

throwaway88293y ago

On that argument, I could see publishers trying to sue. If you ask GPT:

> What's the New York Times scrambled egg recipe?

GPT returns the exact recipe. If I were NYT I'd be frustrated. Their content is now showing without the ad views or paywall.

Pigalowda3y ago

This reminds me a bit of the criticism of “black box logic” for ML models.

Is there something analogous to saliency maps for LLM?

raydiatian3y ago

My feeling is that one of the four happen:

1) AI is open sourced and we adapt stably. Either everybody has the opportunity to be their own business, or there is UBI.

2) AI is open sourced but it is unfairly distributed. Only some people are suited to BTOB, and/or UBI is shit.

3) AI is not open sourced, the wealthy edge out mankind and a planet scale genocide occurs.

4) none of it matters because the looming war between the US & China explodes or climate change wipes us out in any meaningful capacity that could pursue AI.

Given the track record of our species, #1 feels like wishful thinking

bee_rider3y ago

The ad-based model of the internet is bad anyway. I don’t think chatGPT will break it, but we can hope!

omernomer3y ago

I think it should reference the sources of the information, similar to any research paper or essay.

shagie3y ago

Could you cite the need to use citations in common conversation or asking it to write jokes?

bilsbie3y ago

Why should it be less fair than what a search engine does?

It’s really just building a better model.

JohnFen3y ago

A search engine directs people to the original work. This doesn't.

dmak3y ago

Have you used Google in the last 5-10 years? It's been slowly parsing more and more information from websites so you never leave Google.

JohnFen3y ago

I'm aware of that, but I don't really consider that a part of search.

hamburga3y ago

Maybe unfair is the wrong word. I think most agree that scraping, even at a massive scale -- is in itself fair. But is it sustainable?

> Can this start breaking the ad-based model of the internet

Is the alternative that everything is behind some kind of paywall by default, to block scraping? Is that where we're heading?

dd363y ago

Do you want companies to do this in private for private gain and not share it with you? Because making it illegal will just make it happen in greater secrecy.

JohnFen3y ago

In other areas of society where a bad thing cannot be stopped, we still use legislation to reduce the amount of it and mitigate some of the harm.

dd363y ago

I don't think they're doing anything wrong. Indeed, I think they're performing a public service that none of the others seemed positioned to do. The default is to keep advances private.

sublinear3y ago

I agree. ChatGPT should cite its sources.