Can I just wholeheartedly congratulate you for having found a critical benchmark to evaluate LLMs. Either they achieve 100% accuracy in your game, or they cannot be considered trustworthy. I remain very confident that modules must be added to the available architectures to achieve the "strict 100%" result.