There is no need to invent new regulatory methods when the one we have is perfectly sufficient: if somebody can prove your AI model is tainted with unlicensed works, regardless of how you acquired them, then your model as a whole is an infringing work and the affected party can sue you for damages far exceeding the pennies of utility you gained from the action. Your stock prices tanks, your corporate customers cease to purchase your models and you go bankrupt.
It's exactly like the current regime of copyright where I could, in principle, copy paste a file from the Linux kernel and compile it into my binary application, and nobody would know. How much would a single file from a work with tens of thousands of contributors possibly be worth, right? Wrong, it takes a single disgruntled employee (which you are guaranteed to have when you exceed a headcount of roughly 5) to destroy your business and product. The only possible way to avoid this is to train on either public/open sources or get positive authorization for each and every file you slurp for the specific use of AI training, which you definitely won't get for pennies.
As for the inevitable dominance of our AI overbrains fed on open source information, I for one, welcome them. The cat is out of the bag, it's not like we can return to the previous state of affairs. The problem, as always, becomes a political one, how to distribute the fruits of these new technical capabilities to the (human) citizens.