Would it not be categorically intended as infringement regardless of the copyright status of the material?
It seems to me that the licensing part is the part you can't throw into a big markov chain, legally. Even if they aimed only at open-source licensed material without exception, the point where they discard all the licenses and export a 'generic' slurry is the point where they infringe by definition. If they trained on more restrictive licenses that's just doubling down: what's needed is annotation and maintenance of what bits of code came from what licensing pool. You could well have a giant pool of GPL, a giant pool of MIT (which I would be in, all the more since I maintain a very automatable code style that's easy to import from). You could accumulate a list of sources for anything you did, at whatever level of granularity is desired.
The purpose of throwing away this attribution is intent to infringe. It's constructing a machine for the explicit purpose of grinding code into sludge of intentionally small enough pieces that, if you reconstruct copyrighted code in your markov-chainy way, you've got grounds for pretending you didn't build your machine to do exactly that.