In practice what essentially happened is that the super-high-quality Othello data had a huge impact on the parameters of GPT (since it was the last training data it received) and that impact manifested itself as those parameters overfitting to the rules of Othello.
The real test that I would be curious to see is if Othello GPT works when the logic of the rules are the same but the dimensions are different (e.g., smaller or larger boards).
My guess is that the findings would fall apart if asked about tile "N13".