You can train a model on copyrighted text, you just can't distribute the output in any way without violating copyright. (edit: depending on the other fair use factors).
One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.