It seems to me that the licensing part is the part you can't throw into a big markov chain, legally. Even if they aimed only at open-source licensed material without exception, the point where they discard all the licenses and export a 'generic' slurry is the point where they infringe by definition. If they trained on more restrictive licenses that's just doubling down: what's needed is annotation and maintenance of what bits of code came from what licensing pool. You could well have a giant pool of GPL, a giant pool of MIT (which I would be in, all the more since I maintain a very automatable code style that's easy to import from). You could accumulate a list of sources for anything you did, at whatever level of granularity is desired.
The purpose of throwing away this attribution is intent to infringe. It's constructing a machine for the explicit purpose of grinding code into sludge of intentionally small enough pieces that, if you reconstruct copyrighted code in your markov-chainy way, you've got grounds for pretending you didn't build your machine to do exactly that.