> The wider challenge is how that is handled in a compliant way with LLMs and generative tools which vendors do not seem to be taking particularly seriously yet
I'm curious as to why people would want to train LLMs on personal identifying information. What's the benefit of an LLM that has a large collection of names, addresses, dates of birth etc.?
Free-form text like Reddit posts contains a whole load of PII. Since there is absolutely no regard for what goes into a LLM, naturally, they also contain this PII.