It's a big part of why search overview summaries are so awful. Many times the answers are not grounded in the material.
Instead, what can happen is that, like a human, the model (hopefully) disregards the instruction, making it carry (close to) zero weight.
Basically every benchmark worth it's salt uses bespoke problems purposely tuned to force the models to reason and generalize. It's the whole point of ARC-AGI tests.
Unsurprisingly Gemini 3 pro performs way better on ARC-AGI than 2.5 pro, and unsurprisingly it did much better in pokemon.
The benchmarks, by design, indicate you can mix up the switch puzzle pattern and it will still solve it.