All models are tested through OpenRouter. The providers on OpenRouter vary drastically in quality, to the point where some simply serve broken models.
That being said, I usually test models a few hours after release, at which point, the only provider is the "official" one (e.g. Deepseek for their models, Alibaba for their own, etc.).
I don't really have any good solution for testing model reliability for closed-source models, BUT the outcome still holds: a model/provider that is more reliable, is statistically more likely to also give better results during at any given time.
A solution would be to regularly test models (e.g. every week), but I don't have the budget for that, as this is a hobby project for now.