I think this is self-conflicting. If the evaluation is proprietary then it is most certainly not reputable. We'd want open metrics where we can analyze the limitations. Of course, we'd need open data too, but that's exceptionally rare these days. Plus, a metric isn't going to really tell us if we have have spoilage or not. You can get some evidence for spoilage through a trained model, but it is less direct, fuzzier, and more tells us about what information it was able to memorize rather than if the data was spoiled.