tl;dr they can sometimes generalise to the next 1 to 10 tokens (digits or operators), but no more.
This kind of short-term "generalisation" on OOD data is standard in neural nets trying to approximate symbolic regressions or things like grammars etc as far as I know.
I do like they use 'Out of Domain" rather than "Out of Distribution" as a target though. That makes more sense.