First, the statistical tests used for these experiments don't make use of Bayesian stats, so the prior 50%-loaded probability simply isn't factored in. The standard is to use null-hypothesis testing, which says roughly, that if the null hypothesis is true -- that is, if there is no actual difference between the populations (experimental groups A and B, for example) -- what is the probability that you'd see a pattern like the one observed in the data. And the tests take sample size into account in calculating this probability.
If you throw the die once, the test that you'd use here (Chi-square) would _never_ give you a false positive, that is a p-value of <.05. With small samples, there is too little power to get a the requisite p-value. (And I'll note that Chi-square is one of the tests used in these papers.)
There's a whole other debate about whether p-values and null hypothesis tests are the right thing to use, whether the standard 0.05 threshold p-value is small enough, whether Bayesian stats should be used, etc. These are legitimate issues. But they're separate from the claim that small samples will increase the likelihood of a false positive.