NovelQA is a great one! I also like GSM-Symbolic -- a benchmark based on making _symbolic templates_ of quite easy questions, and sampling them repeatedly, varying things like which proper nouns are used, what order relevant details appear, how many irrelevant details (GSM-NoOp) and where they are in the question, things like that.
LLMs are far, _far_ below human on elementary problems, once you allow any variation and stop spoonfeeding perfectly phrased word problems. :)
https://machinelearning.apple.com/research/gsm-symbolic
https://arxiv.org/pdf/2410.05229
Paper came out in October, I don't think many have fully absorbed the implications.
It's hard to take any of the claims of "LLMs can do reasoning!" seriously, once you understand that simply changing what names are used in a 8th grade math word problem can have dramatic impact on the accuracy.