undefined | Better HN

0 pointsamarcheschi1y ago0 comments

I quite like a scenery where llm output can't be copyrighted, so that it is possible to eventually train a llm with data from the previous one(s)

0 comments

layer81y ago

OpenAI argues it’s a violation of their terms of service. So there are legal issues if it can be proven.

Palmik1y ago

Legal issues for who?

Company A pays OpenAI for their API. They use the API to generate or augment a lot of data. They own the data. They post the data on the open Internet.

Company B has the habit of scraping various pages on the Internet to train its large language models, which includes the data posted by Company A. [1]

OpenAI is undoubtedly breaking many terms of service and licenses when it uses most of the open Internet to train its models. Not to mention potential copyright violations (which do not apply to AI outputs).

[1]: This is not hypothetical BTW. In the early days of LLMs, lots of large labs accidentally and not so accidentally trained on the now famous ShareGPT dataset (outputs from ChatGPT shared on the ShareGPT website).

layer81y ago

For both.

1 more reply

mannewalis1y ago

But OpenAI's model isn't open source, how would they distill knowledge without direct access to the model?

layer81y ago

You don’t need direct access for LLM distillation, just regular API access.

1 more reply

j / k navigate · click thread line to collapse

0 comments

layer81y ago

OpenAI argues it’s a violation of their terms of service. So there are legal issues if it can be proven.

Palmik1y ago

Legal issues for who?

Company A pays OpenAI for their API. They use the API to generate or augment a lot of data. They own the data. They post the data on the open Internet.

Company B has the habit of scraping various pages on the Internet to train its large language models, which includes the data posted by Company A. [1]

layer81y ago

For both.

1 more reply

mannewalis1y ago

But OpenAI's model isn't open source, how would they distill knowledge without direct access to the model?

layer81y ago

You don’t need direct access for LLM distillation, just regular API access.

1 more reply

j / k navigate · click thread line to collapse