I was not equating this to a scenario where you are only looking for one watch. The expression (and anecdote behind it) is also used to describe managing according to metrics that are easy to collect, often at the expense of outcomes.
That applies in the case of caring more about false positives than false negatives, since nobody (or close enough) even knows how many false negatives they have.
Your statement that someone being successful elsewhere doesn't prove we have a false negative underscores the uncertainty involved. It doesn't invalidate what I'm saying in the least.
But if we want to be rigorous, we could do some direct testing. Let's say we give a telephone screening and four interviews to each candidate. Simpkifying somewhat, let's say that our policy is that a fail on any one eliminates the candidate.
If we hire lots of people, we can collect false positive and false negative stats by randomly passing someone who fails exactly one test.
If a lot of people who fail a particular test end up succeeding, we can infer that the test provides false negatives.
On the other hand, if everyone who gets the random "free pass" on one particular test ends up failing, we can presume it has low false negatives, and maybe abandon the random passes for that test.