Again, that is just using text only.
Imagine you have a lot more computing resources in a multimodal LLM. It sees your request of count the syllables and realizes it can't do them from text alone (hell I can't and have to vocalize it). It then sends your request to a audio module and 'says' the sentence, then another listening module that understand syllables 'hears' the sentence.
This is how it works in most humans, now if you do this every day you'll likely make some kind of mental shortcut to reduce the effort needed, but at the end of the day there is no unsolvable problem on the AI side.