Yes, they got some data, and got some errors. But they are using 25 samples to characterize "human reasoning". Sure, it's a place to start.
The bigger issue is that this pattern - small samples, via Mechanical Turk - is a frequent flier in papers that make claims that end up failing further scrutiny. It's more common in sociology and psychology than AI research, but I think we know this isn't a solid foundation to build a lot of extrapolation on.