My point isn't that the model falls for gender stereotypes, but that it falls for thinking that it needs to solve the unmodified riddle.
Humans fail at the original because they expect doctors to be male and miss crucial information because of that assumption. The model fails at the modification because it assumes that it is the unmodified riddle and misses crucial information because of that assumption.
In both cases, the trick is to subvert assumptions. To provoke the human or LLM into taking a reasoning shortcut that leads them astray.
You can construct arbitrary situations like this one, and the LLM will get it unless you deliberately try to confuse it by basing it on a well known variation with a different answer.
I mean, genuinely, do you believe that LLMs don't understand grammar? Have you ever interacted with one? Why not test that theory outside of adversarial examples that humans fall for as well?