You're just testing the review ability of particular Linux kernel maintainers at a particular point in time. How does that generalize to the extent needed for it to be valid research on open source software development in general?
You would need to run this "experiment" hundreds or thousands of times across most major open source projects.