Second, the “run-in-parallel” approach has a well defined name in experimental design, called a factorial design. The diagram shown is an example of full factorial design in which each level of each factor is combined with each level of all other factors. The advantage of such design is that interactions between factors can be tested as well. If there are good reasons to believe that there are no interactions between the different factors then you could use a partial factorial design that, which has the advantage of having less total combinations of levels while still enabling estimation of effects of individual factors.
There are so many strong biases people have about different parts about UI/UX. One of the significant benefits of A/B testing is that it lets you move ahead as a team and make decisions even when there are strongly differing opinions on your team. In these cases you can just "A/B test" and let the data decide.
But if you are using Bayesian approaches you'll transition those internal arguments to what the prior should be and it will be harder to get alignment based on the data.
You can present your Bayesian approaches in such a way that it's almost independent of the prior. Your output will be 'this experiment should shift your odds-ratio by so-and-so-many logits in this or that direction' instead of an absolute probability.
The frequentist/bayesian debate is not one I understand well enough to opine - do you have any reading you'd recommend for this topic?
To brutaly simplify the distinction. Using frequentist statistics and testing, you are addressing the question, whether based on the results, you can reject the hypothesis that there is no difference between two conditions (e.g., A and B in A/B testing). The p-value broadly gives you the probability that the data from A and B are sampled from the same distribution. If this is really low, then you can reject the null hypothesis and claim that there are statistically significant differences between the two conditions.
In comparison, using Bayes statistic, you can estimate the pobability of a specific hypothesis. E.g. the hypothesis that A is better than B. You start with a prior belief (prior) in your hypothesis and then compute the posterior probability, which is the prior adjusted for the additional empirical evidence that you have collected. The results that you get can help you address a number of questions. For instance, (i) what is the probability that in general A leads to better results than B. Or related (but substantially different), (ii) what is the probability that for any specific case using A you have a higher chance of success than using B. To illustrate the difference, the probability that men in general are taller than women approaches 100%. However, if you randomly pick a man and a woman, the probability that the man will be higher than the woman is substantially lower.
In your A/B testing, if the cost of A is higher, addressing the question (ii) would be more informative than question (i). You can be quite sure that A is in general better than B, however, is the difference big enough to offset the higher cost?
Related to that, in Bayes statistics, you can define the Region of Practical Equivalence (ROPE) - in short the difference between A and B that could be due to measurment error, or that would be in practice of no use. You can then check in what proportion of cases, the difference would fall within ROPE. If the proportion of cases is high enough (e.g. 90%) then you can conclude that in practice it makes no difference whether you use A or B. In frequentist terms, Bayes allows you to confirm a null hypothesis, something that is impossible using frequentist statistic.
In regards to priors - which another person has mentioned - if you do not have specific reason to believe beforehand that A might be better than B or vice versa, you can use a relatively uninformative prior, basically saying, “I don’t really have a clue, which might be better”. So issue of priors should not discourage you to using Bayes statistics.
Yes: you could use bayesian priors and a custom model to give yourself more confidence from less data. But...
Don't: for most businesses that are so early they can't get enough users to hit stat-sig, you're likely to be better off leveraging your engineering efforts towards making the product better instead of building custom statistical models. This is nerd-sniping-adjacent, (https://xkcd.com/356/) a common trap engineers can fall into: it's more fun to solve the novel technical problem than the actual business problem.
Though: there are a small set of companies with large scale but small data, for whom the custom stats approaches _do_ make sense. When I was at Opendoor, even though we had billions of dollars of GMV, we only bought a few thousand homes a month, so the Data Science folks used fun statistical approaches like Pair Matching (https://www.rockstepsolutions.com/blog/pair-matching/) and CUPED (now available off the shelf - https://www.geteppo.com/features/cuped) to squeeze a bit more signal from less data.
I always say in my profession I will fit models for free, it’s having to clean data and “finish” a project that I demand payment.
This is a key point, imo. I have a sneaking suspicion that a lot of companies are running "growth teams" that don't have the scale where it actually makes sense to do so.
Some growth teams are trying more exploratory approaches to find something that resonates with simpler approaches. Others rely on A/B tests. Different profiles, but both are “Growth teams”.
It's not unreasonable to assume it's a sample, I just don't think it's worth getting paralyzed by worrying about whether or not you have power, or getting into hacky tricks to try to fix it.
...but most power calculations are also sort of bullshit.
Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments, so your significance calculation is now off. Second, xkcd 882.
I get that a bunch at some of my clients. It's a common misconception. Let's say experiment B is 10% better than control but we're also running experiment C at the same time. Since C's participants are evenly distributed across B's branches, by default they should have no impact on the other experiment.
If you do a pre/post comparison, you'll notice that for whatever reason, both branches of C are doing 5% better than prior time periods, and this is because half of them are in the winner branch of B.
NOW - imagine that the C variant is only an improvement _if_ you also include the B variant. That's where you need to be careful about monitoring experiment interactions, I called out in the guide. But better so spend a half day writing an "experiment interaction" query than two weeks waiting for the experiments to run in sequence.
> Second, xkcd 882 (https://xkcd.com/882/) I think you're referencing P-hacking, right?
That is a valid concern to be vigilant for. In this case, XKCD is calling out the "find a subgroup that happens to be positive" hack (also here, https://xkcd.com/1478/). However, here we're testing (a) 3 different ideas and (b) only testing each of them once on the entire population. No p-hacking here (far as I can tell, happy to learn otherwise), but good that you're keeping an eye out for it.
And the more experiments you run, whether in parallel or sequentially, the more likely you're to get at least one false positive, i.e. p-hacking. XKCD is using "find a subgroup that happens to be positive" to make it funnier, but it's simply "find an experiment that happens to be positive". To correct for p-hacking, you would have to lower your threshold for each experiment, requiring a larger sample size, negating the benefits you thought you were getting by running more experiments with the same samples.