Like if I told it solve a complex puzzle equation not in its training data and it correctly solved that problem. We know from the low probability of arriving at that solution from random chance that the LLM must know and understand and reason to arrive at that solution.
Now you’re saying you perturb the input with some grammar changes but leave everything else the same and the LLM will now produce a wrong answer. But this doesn’t change the fact that it was able to get the right answer.
Humans can be dumb and inconsistent. LLMs can be dumb and inconsistent too. This happens to be a quirk of the LLM. But you cannot deny that it is intelligent on the sole fact that LLMs can produce output that we know for sure can only be arrived at through reasoning.