undefined | Better HN

0 pointsMacha3y ago0 comments

Unless something has changed, the training data also includes copyleft code, not just permissively licensed code

0 comments

Regarding the training of the model - I don't think a copyright can restrict reading, and training is reading, not distributing any original data.

About deploying the model - it just needs to filter out verbatim exact snippets so it only outputs original, unattributable code. That can be done by hashing ngrams and a bloom filter. The vast majority of code generated by Codex is original anyway.

By the way, Codex is good for many other tasks, like, parsing the fields of a receipt, or extracting the summary of an email, or generating baby names, it's an all purpose NLP tool. Just call it like a function. Code completion is just one thing it does. It talks pretty great English, can compose poems.

CryZe3y ago

> it just needs to filter out verbatim exact snippets so it only outputs original, unattributable code.

That's a setting now.

j / k navigate · click thread line to collapse

0 comments

visarga3y ago

Regarding the training of the model - I don't think a copyright can restrict reading, and training is reading, not distributing any original data.

CryZe3y ago

> it just needs to filter out verbatim exact snippets so it only outputs original, unattributable code.

That's a setting now.

j / k navigate · click thread line to collapse