But they can because they're matching the hashes to the ones provided by NCMEC, not directly against CSAM itself (which presumably stays under some kind of lock and key at NCMEC.)
Same as you can test whether you get false positives against a bunch of MD5 hashes that Fred provides without knowing the contents of his documents.
how does anyone ever actually fight the nasty stuff? This problem structure of how do I catch examples of A if examples of A are illegal must apply in many places and ways.
They don't need to train a model to detect the actual data set. They need to train a model to follow a pre-defined algo
No idea if they did (or will), but I do expect it’s possible.
Sounds like that's what they did since they say they're matching against hashes provided by NCMEC generated from their 200k CSAM corpus.
[edit: Ah, in the PDF someone else linked, "First, Apple receives the NeuralHashes corresponding to known CSAM from the above child-safety organizations."]