They do. That's the whole point of the paper: you can set a bunch of weights manually like you suggest, but can you learn them instead; and how? See the Introduction. They make it very clear that they are investigating whether certain concepts can be learned by gradient descent, specifically. They point out that earlier work doesn't do that and that gradient descent is an obvious bit of bias that should affect the ability of different architectures to learn different concepts. Like I say, good work.
>> But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.
You could always try it out yourself, you know. Like I say that's the beauty of grammars: you can generate tons of synthetic data and go to town.
>> You can't deduce much from negative results in research beside it requiring more work.
I disagree. I'm a falsificationist. The only time we learn anything useful is when stuff fails.