There’s a 5% chance that these results are total bullshit (opens in new tab)

(metricsparrow.com)

22 pointsallforJesse10y ago25 comments

25 comments

> Stop saying: “We’ve reached 95% statistical significance.”

> And start saying: “There’s a 5% chance that these results are total bullshit.”

Argh, no, no, no and no!

95% significance is NOT 95% probability! When you select a confidence level of a 95%, the probability that your results are nonsense is ZERO or ONE. There is no probability statement associated to it. Just because something is unknown does not mean that you can make a probability statement about it, and the mathematics around statistical testing all depend on the assumption that the parameter being tested is not random, merely unknown...

Rather, 95% statistical significance means, we got this number from a procedure that 95% of the time produces the right thing, but we have no idea whether this particular number we got is correct or not.

UNLESS!

Unless you're doing Bayesian stats. But in that case your procedure looks completely different and produces very different probability intervals instead of confidence intervals, and you don't talk about statistical significance at all, but about raw probabilities.

nickm1210y ago

> 95% statistical significance means, we got this number from a procedure that 95% of the time produces the right thing

The original post is incorrect about the probabilistic interpretation of the 95% confidence interface, but this interpretation is also wrong.

In classical statistics, p<0.05 means that, if there is no difference in our sample populations (i.e. the null hypothesis), then the probability of observing a difference at least this extreme is less than 0.05.

hencq10y ago

I'm not really sure what you're trying to say.

> Rather, 95% statistical significance means, we got this number from a procedure that 95% of the time produces the right thing, but we have no idea whether this particular number we got is correct or not.

I.e. We got this number from a procedure and there's a 5% chance it didn't produce the right thing.

Fomite10y ago

Nope. It's "If we did this infinitely more times, there's a 5% of those samples wouldn't have significant results". It's a subtle but important distinction.

Though I'm surprised that his advice wasn't "Report confidence intervals at least". There's much more meaningful information in a point estimate and confidence interval than "p < 0.05"

1 more reply

allforJesseOP10y ago

If you have a few minutes to spare, I would very much welcome your thoughts so that I can either correct the article, or take it down - The last thing I want is for it to sit out there on the open internet as misinformation.

My goal was to create a framework which — while less mathematically accurate (hence “rhetorical device”) — helped convey the seriousness of making business decisions based on P = 0.05 to people for whom 95% statistical significance doesn’t mean anything. And clearly, based on reactions here, I failed at that goal.

So, if you’re game, I’ll quickly to walk you through my thinking, and you can help me understand where I went wrong.

Best way to contact?

rcthompson10y ago

Rephrasing "95% probability of correctness" as "5% chance of bullshit" is perfectly fine, and a good way to look at things. The problem is that "p = 0.05" doesn't mean either of those things, or even anything close to either of those things. P-values are always taking about a null hypothesis, and only the null hypothesis. The p-value answers "How rare would this result be if the null hypothesis is true?" Note that the alternative hypothesis, which is what you really want to know about, never even enters the question. This is why people have such issues with p-values. People want to know about the alternative hypothesis, and they want to believe that the statistical tool they're using is answering their question, but a p-value is answering a different question entirely.

It's intuitively obvious that a result that is unlikely under the null hypothesis constitutes some evidence in favor of the alternative hypothesis, but the precise nature of that relationship depends on information that is not usually available, such as prior estimates of the likelihood that each model is true. If such information is available, you can use Bayesian statistics to answer the question that you really want to ask (e.g. "What is the probability that the alternative hypothesis is true given this data?"), instead of using p-values to answer the only question you are capable of answering, even though that answer isn't a particularly useful one.

For a concrete example, xkcd comes to the rescue: https://xkcd.com/882/

Consider that, when testing the 20 flavors, you expect to get at least one p-value of 0.05 by random chance, since 0.05 = 1 in 20. So in this specific case there's actually a very high probability (much higher than 5%, even higher than 50%) that the result is bullshit. But even when you're doing a single test, not 20 of them, a p-value of 0.05 can still mean much higher than 5% of bullshit. Or it could be much lower.

Lastly, note that "confidence intervals" are just a statement of the thresholds for p-values. For example, the 95% confidence interval includes your null hypothesis if and only if your p-value is greater than 0.05. So everything I said above about p-values applies equally well to confidence intervals. In particular, "95% confidence interval" does NOT mean "95% confidence that the value is within this interval".

If you want to ask me some more questions, email me at rct at thompsonclan dot org.

1 more reply

scishop10y ago

No.

In Frequentist thinking; p=0.05 means that if there was in reality no difference in your A and B and you repeated the experiment many times, 5% of the observed differences would be equal to or greater than the difference you just measured.

No probabilistic statement about the results being correct or incorrect can be made from a Null-Hypothesis significance test.

allforJesseOP10y ago

(apologies for repeating from above)

So, if you’re game, I’ll quickly to walk you through my thinking, and you can help me understand where I went wrong. Best way to contact?

nickm1210y ago

+1 Well said.

This is correct and the original post is wrong.

pcrh10y ago

What is the difference between what you wrote and saying "95% of the observed differences would be equal to or less than..."

The p-value (in my understanding) makes a prediction about what would occur if the experiment was repeated infinity times.

CountBayesie10y ago

I've long argued that the biggest problem with orthodox NHST for A/B testing is that you actually don't care about 'significance of effect' as much as you do 'magnitude of effect'. Furthermore, p-values tell you nothing about the range of possible improvements (or lack thereof) you're facing. Maybe you are willing to risk potential losses for potentially huge gains, or maybe you can't afford to lose a single customer and would rather exchange time for certainty.

My favored approach I've outlined here[0]. Where the problem is basically considered one of Bayesian parameter estimation. Benefits include:

1. Output is a range of possible improvements so you can reason about risk/reward for calling a test early.

2. Allows the use of prior information to prevent very early stopping, and provide better estimates early on.

3. Every piece of the testing setup is, imho, easy to understand (ignore this benefit if you can comfortably derive Student's T-distribution from first principles)

[0] https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-test...

JDDunn910y ago

Lots of knit picking here. In plain English, confidence intervals are about your results being bogus. You flipped 100 coins, all of them came up heads, you conclude 100% of coin tosses come up heads. By chance, you got a very unlikely sample that differed substantially from the population. You could also conclude your A/B test is a success, when it was just randomly atypical.

thanatropism10y ago

You have to wonder: what else from their junior year in college did mr. Avshalomov get completely wrong?

How many of the recent YC graduates fail at basic numeracy? Does node.js mean you don't have to understand data structures and algorithms to successfully "preneur" too?

I mean, in finance this doesn't do. Or in consulting. So there's adverse selection to worry too.

sbov10y ago

I'm not a statistician, but lately I've been wondering:

When we're A/B testing code, the code is already written. If there's a 5%, or even 15% chance of it being bullshit, who cares? The effort is usually exactly the same if I switch or not.

It's my understanding that 95%, 99%, etc, were established for things that require extra change. We don't want to spend extra time developing and marketing a new drug if it isn't effective. We don't want to tell people to do A instead of B if we aren't sure A is really better than B.

But in software I've already spent all the time I need to to implement the variation on the feature. So given that, why do I need 95%?

I would appreciate if someone with more knowledge can answer this question.

Edit to add: I see a lot of answers about the cost to keep the code around. What about A/B tests that don't require extra code, just different code? Most of our A/B tests fall into this category.

oberstein10y ago

You would be better served by a Bayesian approach to A/B testing, and measure directly the probability of A converting more than B. http://www.evanmiller.org/bayesian-ab-testing.html You can then apply some sort of decision rule such as the difference being above a certain threshold.

dgant10y ago

Keeping the code around has a cost: the cost of maintaining, reading, compiling, deploying, and building around it. If the new feature isn't adding value, all it's doing is adding to the cost of real value-features you want to develop later.

jy13310y ago

Would you push a feature that negative affected your product? 95% confidence you will be able to know if you're feature is indeed positive, negative, or roughly neutral.

andreasklinger10y ago

I think the core question is:

validation of upside vs validation of downside

as in: i want to avoid pushing something that is worse but i am optimistic (up to even indifferent) about how much something is better

personal opinion: data trains gut-feeling

mathattack10y ago

It gets even worse.

If you try 100 tests, and pick the 5 that pass the Statistically Significant threshold, most likely all 5 are BS.

glaberficken10y ago

Is it just me or this sentence makes no mathematical sense at all?

"If you’re running squeaky clean A/B tests at 95% statistical significance and you run 20 tests this year, odds are one of the results you report (and act on) is going to be straight up wrong."

Fomite10y ago

"We’re taking techniques that were designed for static sample sizes and applying them to continuous datasets" - Wait, seriously? Do A/B testers not use the very well developed techniques that exist for time-series data?

Grue310y ago

It just means A is slightly worse than B. Or equal to B. Or much worse than B, but that is quite unlikely (way less than 5%).

j / k navigate · click thread line to collapse

25 comments

jordigh10y ago

> Stop saying: “We’ve reached 95% statistical significance.”

> And start saying: “There’s a 5% chance that these results are total bullshit.”

Argh, no, no, no and no!

UNLESS!

nickm1210y ago

> 95% statistical significance means, we got this number from a procedure that 95% of the time produces the right thing

The original post is incorrect about the probabilistic interpretation of the 95% confidence interface, but this interpretation is also wrong.

hencq10y ago

I'm not really sure what you're trying to say.

I.e. We got this number from a procedure and there's a 5% chance it didn't produce the right thing.

Fomite10y ago

Nope. It's "If we did this infinitely more times, there's a 5% of those samples wouldn't have significant results". It's a subtle but important distinction.

Though I'm surprised that his advice wasn't "Report confidence intervals at least". There's much more meaningful information in a point estimate and confidence interval than "p < 0.05"

1 more reply

allforJesseOP10y ago

So, if you’re game, I’ll quickly to walk you through my thinking, and you can help me understand where I went wrong.

Best way to contact?

rcthompson10y ago

For a concrete example, xkcd comes to the rescue: https://xkcd.com/882/

If you want to ask me some more questions, email me at rct at thompsonclan dot org.

1 more reply

scishop10y ago

No.

No probabilistic statement about the results being correct or incorrect can be made from a Null-Hypothesis significance test.

allforJesseOP10y ago

(apologies for repeating from above)

So, if you’re game, I’ll quickly to walk you through my thinking, and you can help me understand where I went wrong. Best way to contact?

nickm1210y ago

+1 Well said.

This is correct and the original post is wrong.

pcrh10y ago

What is the difference between what you wrote and saying "95% of the observed differences would be equal to or less than..."

The p-value (in my understanding) makes a prediction about what would occur if the experiment was repeated infinity times.

CountBayesie10y ago

My favored approach I've outlined here[0]. Where the problem is basically considered one of Bayesian parameter estimation. Benefits include:

1. Output is a range of possible improvements so you can reason about risk/reward for calling a test early.

2. Allows the use of prior information to prevent very early stopping, and provide better estimates early on.

3. Every piece of the testing setup is, imho, easy to understand (ignore this benefit if you can comfortably derive Student's T-distribution from first principles)

[0] https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-test...

JDDunn910y ago

thanatropism10y ago

You have to wonder: what else from their junior year in college did mr. Avshalomov get completely wrong?

How many of the recent YC graduates fail at basic numeracy? Does node.js mean you don't have to understand data structures and algorithms to successfully "preneur" too?

I mean, in finance this doesn't do. Or in consulting. So there's adverse selection to worry too.

sbov10y ago

I'm not a statistician, but lately I've been wondering:

When we're A/B testing code, the code is already written. If there's a 5%, or even 15% chance of it being bullshit, who cares? The effort is usually exactly the same if I switch or not.

But in software I've already spent all the time I need to to implement the variation on the feature. So given that, why do I need 95%?

I would appreciate if someone with more knowledge can answer this question.

Edit to add: I see a lot of answers about the cost to keep the code around. What about A/B tests that don't require extra code, just different code? Most of our A/B tests fall into this category.

oberstein10y ago

dgant10y ago

jy13310y ago

Would you push a feature that negative affected your product? 95% confidence you will be able to know if you're feature is indeed positive, negative, or roughly neutral.

andreasklinger10y ago

I think the core question is:

validation of upside vs validation of downside

as in: i want to avoid pushing something that is worse but i am optimistic (up to even indifferent) about how much something is better

personal opinion: data trains gut-feeling

mathattack10y ago

It gets even worse.

If you try 100 tests, and pick the 5 that pass the Statistically Significant threshold, most likely all 5 are BS.

glaberficken10y ago

Is it just me or this sentence makes no mathematical sense at all?

"If you’re running squeaky clean A/B tests at 95% statistical significance and you run 20 tests this year, odds are one of the results you report (and act on) is going to be straight up wrong."

Fomite10y ago

Grue310y ago

It just means A is slightly worse than B. Or equal to B. Or much worse than B, but that is quite unlikely (way less than 5%).

j / k navigate · click thread line to collapse