So, the presumed attack (not against individuals, but to defeat the system) is
1. Identify some innocuous pictures that many many people have (memes, Beyoncé, whatever).
2. Produce CSAM.
3. Mangle it such that it is still CSAM visually, but NeuralHash-collides with the innocuous pictures from step 1.
4. Distribute.
5. Wait until they are (via some other mechanism) a) identified as CSAM, b) added to the NCMEC database, c) added to the Apple on-device database of blinded hashes in some iOS update.
6. Millions of people are suddenly incorrectly flagged for exceeding the threshold by NeuralHash (since they have the innocuous pictures in their library), and the review teams are flooded and can't pick out the small number of actual CSAM holders.
That is not without a certain elegance. However, it seems to me that
A) it is predicated on the assumption that you can easily mangle pictures to NeuralHash-collide with a desired target picture (out of a set of widely circulating innocuous pictures) without deteriorating the visual content too much.
B) it would be quickly defeated by amending the 2nd tier algorithm (between NeuralHash and human review), though, as you highlight, that might be tricky given that the team working on this presumably only has access to the innocuous false positive collision image, not the (purposefully mangled) CSAM.