> But how do you know an input is adversarial?
Prompt injection and jailbreaking attempts are pretty clear. I don't think anything else is particularly concerning.
> the false positive rate means you'd need manual review of all the rejects (unless you wanted to reject something like 5% of genuine research)
Not all rejects, just those that submit an appeal. There are a few options, but ultimately appeals require some stakes, such as:
1. Every appeal carries a receipt for a monetary donation to arxiv that's refunded only if the appeal succeeds.
2. Appeal failures trigger the ban hammer with exponentially increasing times, eg. 1 month, 3 months, 9 months, 27 months, etc.
Bad actors either respond to deterrence or get filtered out while funding the review process itself.