The open models only give the SOTA models a run for their money on gameable benchmarks. On the semi-private ARC-AGI 2 sets they do absolutely awfully (<10% while SOTA is at ~80%)
It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.