I think they're saying that frontier LLMs may be usable to spot citations that are correct by shape (a real citation) but incorrect by usage (unrelated to the text)
I kind of hate the idea, but you probably could do a lazy LLM check of every paper and every citation and have it flag possible wrong (second sense) citations for human review
But you'd need a LOT of tokens and a LOT of human-hours