There is sufficient stochasticity in LLMs to invalidate most comparisons at this level. Minor changes in the prompt text, even from run to run in the same model, will produce different results (depending on temperature and other paramters), much less different models.
Try re-running your test on the same model multiple times with the identical prompt,
or varying the prompt.
Depending on how much context the service you choose is keeping for you across a conversation,
the behavior can change.
Something as simple as prompting an incorrect response with a request to try again because the result was wrong can give different results.
Statistically,
the model will eventually hit on the right combination of vectors and generate the right words from the training set,
and as I noted before,
this problem has a very high probability of being in the training data used to build all the models easily available.