undefined | Better HN

0 pointsdredmorbius3y ago0 comments

Others have addressed the ways in which A/B testing does hew quite closely to the standard of empirical observation under controlled circumstances, with which I largely agree.

Where A/B studies may go wrong in my view is a few other elements:

- A/B studies have difficulty in determining differences based on multiple interacting characteristics. In fairness, so does empirical science, and the principle of "holding all else constant" is a frequent assumption of scientific processes.

- A/B studies face an inherent self-selection / exclusion bias: the participants in this round of A/B testing are those who've not been driven off the project/product from past experiments and design changes. Given that many Web 2.0 companies eventually dance with pushing people right up to the border of tolerance, it's quite possible that A/B testing has a long-term effect of pushing those participants whose tolerance has been exceeded out of the study population entirely. I don't know how large a factor this is, though loud / rage quitters are certainly a prominent (if not necessarily large) cohort. Whether or not they're also influential, or perhaps more importantly when they become influential is another question. Again, this is a fairly common problem with any social experiment, including natural social experiments, see various forms of brain-drain and social flight.

- A/B testing tends to focus on short term changes and behaviours, which may mask longer-term outcomes. This has some overlap with the above, but also with subjects' general response to change. See the classic case of this in the Hawthorne Effect (<https://en.wikipedia.org/wiki/Hawthorne_effect>).

The upshot is that A/B testing can be valid and useful, but that experimental design, particularly in the case of social and psychological experiments, where subject feed back into the study and its methodology itself, is exceptionally thorny.

0 comments

TeMPOraL3y ago

> it's quite possible that A/B testing has a long-term effect of pushing those participants whose tolerance has been exceeded out of the study population entirely.

I'd compare this to how evaporation rate increases with temperature, as more particles find themselves with enough energy to escape the liquid.

From my personal experience, even if I can tolerate a lot of UX abuse, each such "optimization" lowers my threshold of switching to a competitor. Software in general, and SaaS specifically, resists commoditization, but every now and then an actual alternative to a product/service I'm using shows up - and whether or not I switch (and when) is correlated with how much I resent the incumbent for their UX "improvements".

I'd add one bullet point to your list:

- Unlike regular scientific experimentation, A/B testing is a methodology primarily spread in business circles using regular hype channels. That is, the average practitioner is not qualified to execute it correctly, which is one of the reasons I see A/B testing more as tools to launder arbitrary decisions. Because consequences of doing it wrong are typically not immediately apparent or obvious, both companies and customers suffer (and a vast space for fraudsters is created).

I'm in a charitable mood, so I'm not passing judgement on people for not having PhD-level understanding of statistics - just pointing out that, to the degree much larger than in sciences (even soft ones, which suffer some of the same structural problems), there's little pressure to do such tests correctly (and there's lot of ways to make money or status by doing them without regards for correctness).

From what I hear, a common way of executing A/B test badly and getting bullshit results, is by terminating the test early when it shows the relevant metrics improving for the test group - vs. running it longer if no big improvements are observed (or the metrics start getting worse for the test group). This biases the experiment towards giving false positives. This problem was big enough that there was a debacle around Optimizely few years ago, whose UI was accused to promote this early termination of tests. The cynical take I'm still somewhat partial to is that it wasn't an accident (if not done on purpose, then possibly... a result of an A/B test!) - false positives make the (statistically naive) users feel they're getting more value from Optimizely than they actually are.

dredmorbiusOP3y ago

Three excellent points, yes.

There's a reason that the technical term for "A/B testing in SaaS products" is gaslighting.

O hai werd uv yeer!

<https://news.ycombinator.com/item?id=33784262>

j / k navigate · click thread line to collapse