My brain can navigate a computer interface without using word tokens, since I have tokens for navigating OS and browsers and tabs etc. That way I don't have to read a million tokens of text to figure out where buttons are or how to navigate to places, since my brain is smart enough to not use words for it.
ChatGPT doesn't have that sort of thing currently, and until it does it will always be really bad at that sort of thing.
You are using a hand to hammer a nail, that will never go well, the solution isn't to use more hands the solution is to wield a hammer.