One could team up with Hackerrank/leetcode, let the model code in the interface (maybe there's an API for that already, no idea), execute their code verbatim and see how many test cases they get right the first time around. Then, like for humans, give them a clue about one of the tests no passing (or code not working, too slow, etc.). Give points based on the difficulty of the question and the number of clues needed.
I guess the obvious caveat is that these model are probably overfitted on these types of questions. But a specific benchmark could be made containing question kept secret for models. Time to build "Botrank" I guess.