Well the feathers heavier than lead thing is definitely somewhere in training data.
Imo we should be testing reasoning for these models by presenting things or situations that neither the human or machine has seen or experienced.
Think; how often do humans have a truly new experience with no basis on past ones? Very rarely - even learning to ride a bike it could be presumed that it has a link to walking/running and movement in general.
Even human "creativity" (much ado about nothing) is creating drama in the AI space...but I find this a super interesting topic as essentially 99.9999% of all human "creativity" is just us rehashing and borrowing heavily from stuff we've seen or encountered in nature. What are elves, dwarves, etc than people with slightly unusual features. Even aliens we create are based on: humans/bipedal, squid/sea creature, dragon/reptile, etc. How often does human creativity really, _really_ come up with something novel? Almost never!
Edit: I think my overarching point is that we need to come up with better exercises to test these models, but it's almost impossible for us to do this because most of us are incapable of creating purely novel concepts and ideas. AGI perhaps isn't that far off given that humans have been the stochastic parrots all along.