Ya interesting thought - would be fascinating if generating games w/solutions is part of the training data pipeline. There's been previous work done on on testing LLMs on logic puzzles[1][2][3] so they could possibly be building off those ideas to improve performance.
[1] https://huggingface.co/papers/2504.00043
[2] https://huggingface.co/blog/yuchenlin/zebra-logic
[3] https://arxiv.org/pdf/2403.12094