Pretty much anyone can scrape GitHub and train their model.
What exactly is the legal implications of this has yet to be tested.
Pretty much every model is susceptible to some sort of model inversion or set inclusion attack.
By their own admissions Co-Pilot sometimes outputs PII that part of the code and code snippets verbatim, even if it’s rare iirc around 0.1% it’s still a huge legal liability for anyone who uses the tool, especially since it’s unclear how these inclusions are spread out and what triggers them. For example it could be that a particular coding / usage of Co-Pilot style or working on a specific subset of problems increases the likelihood of this occurring.
ML is too new to have been tested in court this has more ramifications beyond just licensing, for example if you use PII to train a dataset and receive a GDPR deletion request do you need to throw away and retrain your model?
I don’t think people should be angry however I also think that this needs to be test in court and multiple times before this can be “safe to use”.
But I also don’t think that the ML model is necessarily a derivative work.
For example if you use copyleft material to construct a CS course someone would be hard pressed to argue that the course now needs to be released freely yet alone that anything that the students would write after attending the course would fall under derivative work too.