You are very likely right. The question is how far the approximation can generalise? One way to test that would be to quizz the model with slightly varied prompts. Any human who can “solve” this world problem should be reasonably expected to solve the same problem if we change the subject’s name. ( From Jane to Bob, or Sanj, or even to Xcfg.) Or the name of the object (From balloon to token, or even to embobler). Or the attributes used to segment them. (From red/blue to heavy/light for example)
Or we can try to rewrite the challenge sentences with different wording. As long as the new sentences convey the same problem you would expect that a system who can “understand” them would generate the same or similar solution.
Curiously this kind of thought experiment also shows a weakness of the Turing-test as originally formulated. A machine correctly solving these word puzzle variations could “prove” that it “understands” the sentences, but it would also reveal that it is not a human. Since i would expect a real human to protest against the inanity of the challenges quite fast. ;)