Although, anyone claiming a 7b LLM is better than a well trained 70b LLM like Llama 2 70b chat for the general case, doesn't know what they are talking about.
In the future will it be possible? Absolutely, but today we have no architecture or training methodology which would allow it to be possible.
You can rank models yourself with a private automated benchmark which models don't have a chance to overfit to or with a good human evaluation study.
Edit: also, I guess OP is talking about Mistral finetunes (ones overfitting to the benchmarks) beating out 70b models on the leaderboard because Mistral 7b is lower than Llama 2 70b chat.