> And start saying: “There’s a 5% chance that these results are total bullshit.”
Argh, no, no, no and no!
95% significance is NOT 95% probability! When you select a confidence level of a 95%, the probability that your results are nonsense is ZERO or ONE. There is no probability statement associated to it. Just because something is unknown does not mean that you can make a probability statement about it, and the mathematics around statistical testing all depend on the assumption that the parameter being tested is not random, merely unknown...
Rather, 95% statistical significance means, we got this number from a procedure that 95% of the time produces the right thing, but we have no idea whether this particular number we got is correct or not.
UNLESS!
Unless you're doing Bayesian stats. But in that case your procedure looks completely different and produces very different probability intervals instead of confidence intervals, and you don't talk about statistical significance at all, but about raw probabilities.
The original post is incorrect about the probabilistic interpretation of the 95% confidence interface, but this interpretation is also wrong.
In classical statistics, p<0.05 means that, if there is no difference in our sample populations (i.e. the null hypothesis), then the probability of observing a difference at least this extreme is less than 0.05.
> Rather, 95% statistical significance means, we got this number from a procedure that 95% of the time produces the right thing, but we have no idea whether this particular number we got is correct or not.
I.e. We got this number from a procedure and there's a 5% chance it didn't produce the right thing.
Though I'm surprised that his advice wasn't "Report confidence intervals at least". There's much more meaningful information in a point estimate and confidence interval than "p < 0.05"
My goal was to create a framework which — while less mathematically accurate (hence “rhetorical device”) — helped convey the seriousness of making business decisions based on P = 0.05 to people for whom 95% statistical significance doesn’t mean anything. And clearly, based on reactions here, I failed at that goal.
So, if you’re game, I’ll quickly to walk you through my thinking, and you can help me understand where I went wrong.
Best way to contact?
It's intuitively obvious that a result that is unlikely under the null hypothesis constitutes some evidence in favor of the alternative hypothesis, but the precise nature of that relationship depends on information that is not usually available, such as prior estimates of the likelihood that each model is true. If such information is available, you can use Bayesian statistics to answer the question that you really want to ask (e.g. "What is the probability that the alternative hypothesis is true given this data?"), instead of using p-values to answer the only question you are capable of answering, even though that answer isn't a particularly useful one.
For a concrete example, xkcd comes to the rescue: https://xkcd.com/882/
Consider that, when testing the 20 flavors, you expect to get at least one p-value of 0.05 by random chance, since 0.05 = 1 in 20. So in this specific case there's actually a very high probability (much higher than 5%, even higher than 50%) that the result is bullshit. But even when you're doing a single test, not 20 of them, a p-value of 0.05 can still mean much higher than 5% of bullshit. Or it could be much lower.
Lastly, note that "confidence intervals" are just a statement of the thresholds for p-values. For example, the 95% confidence interval includes your null hypothesis if and only if your p-value is greater than 0.05. So everything I said above about p-values applies equally well to confidence intervals. In particular, "95% confidence interval" does NOT mean "95% confidence that the value is within this interval".
If you want to ask me some more questions, email me at rct at thompsonclan dot org.
In Frequentist thinking; p=0.05 means that if there was in reality no difference in your A and B and you repeated the experiment many times, 5% of the observed differences would be equal to or greater than the difference you just measured.
No probabilistic statement about the results being correct or incorrect can be made from a Null-Hypothesis significance test.
If you have a few minutes to spare, I would very much welcome your thoughts so that I can either correct the article, or take it down - The last thing I want is for it to sit out there on the open internet as misinformation.
My goal was to create a framework which — while less mathematically accurate (hence “rhetorical device”) — helped convey the seriousness of making business decisions based on P = 0.05 to people for whom 95% statistical significance doesn’t mean anything. And clearly, based on reactions here, I failed at that goal.
So, if you’re game, I’ll quickly to walk you through my thinking, and you can help me understand where I went wrong. Best way to contact?
This is correct and the original post is wrong.
The p-value (in my understanding) makes a prediction about what would occur if the experiment was repeated infinity times.
My favored approach I've outlined here[0]. Where the problem is basically considered one of Bayesian parameter estimation. Benefits include:
1. Output is a range of possible improvements so you can reason about risk/reward for calling a test early.
2. Allows the use of prior information to prevent very early stopping, and provide better estimates early on.
3. Every piece of the testing setup is, imho, easy to understand (ignore this benefit if you can comfortably derive Student's T-distribution from first principles)
[0] https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-test...
How many of the recent YC graduates fail at basic numeracy? Does node.js mean you don't have to understand data structures and algorithms to successfully "preneur" too?
I mean, in finance this doesn't do. Or in consulting. So there's adverse selection to worry too.
When we're A/B testing code, the code is already written. If there's a 5%, or even 15% chance of it being bullshit, who cares? The effort is usually exactly the same if I switch or not.
It's my understanding that 95%, 99%, etc, were established for things that require extra change. We don't want to spend extra time developing and marketing a new drug if it isn't effective. We don't want to tell people to do A instead of B if we aren't sure A is really better than B.
But in software I've already spent all the time I need to to implement the variation on the feature. So given that, why do I need 95%?
I would appreciate if someone with more knowledge can answer this question.
Edit to add: I see a lot of answers about the cost to keep the code around. What about A/B tests that don't require extra code, just different code? Most of our A/B tests fall into this category.
validation of upside vs validation of downside
as in: i want to avoid pushing something that is worse but i am optimistic (up to even indifferent) about how much something is better
personal opinion: data trains gut-feeling
If you try 100 tests, and pick the 5 that pass the Statistically Significant threshold, most likely all 5 are BS.
"If you’re running squeaky clean A/B tests at 95% statistical significance and you run 20 tests this year, odds are one of the results you report (and act on) is going to be straight up wrong."