Which is not necessarily hypocritical. The amount of copying needed for something to be copyright infringement is not high… but it's still significantly higher than the amount needed to leak information. For that, just a few words will do, e.g.
// For Windows 12
or // Fuck [company name]
or long secret_key[2] = {0x1234567812345678, 0x8765432187654321};That as long as they have the right to read the data, they have the right to train an AI on it. The fact that the code is available under an open source license is irreverent to them.
As for why they didn't use their own private code to train their AI, I suspect it was more of a non-malicious: "we don't need to, this public github repo dataset is big enough for now"
Personally, I think Microsoft should double down on this legal stance. Train the AI on all their internal code. And train it on any code they have licensed from other companies too.
How is that conciled with the fact that a person that read copyrighted code (not even the original source code, a mere decompiled version of it !) is forbidden to reimplement it directly:
https://www.computerworld.com/article/2585652/reverse-engine...
Are you sure?