> If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns.
The "9500" quote is my conjecture of what might happen if they fix their approach, but the burden of proof is definitely not on me to actually fix their writeup and spend a bunch of money to run a new eval! They are the ones making a claim on shaky ground, not me.
You don't think that security companies (and likely these guys as well) develop systems for doing this stuff?
I'm not a security researcher and I can imagine a harness that first scans the codebase and describes the API, then another agent determines which functions should be looked at more closely based on that description, before handing those functions to another small llm with the appropriate context. Then you can even use another agent to evaluate the result to see if there are false positives.
I would wager that such a system would yield better results for a much lower price.
Instead we are talking about this marketing exercise "oohh our model is so dangerous it can't be released, and btw the results can't be independently verified either"