Where A/B studies may go wrong in my view is a few other elements:
- A/B studies have difficulty in determining differences based on multiple interacting characteristics. In fairness, so does empirical science, and the principle of "holding all else constant" is a frequent assumption of scientific processes.
- A/B studies face an inherent self-selection / exclusion bias: the participants in this round of A/B testing are those who've not been driven off the project/product from past experiments and design changes. Given that many Web 2.0 companies eventually dance with pushing people right up to the border of tolerance, it's quite possible that A/B testing has a long-term effect of pushing those participants whose tolerance has been exceeded out of the study population entirely. I don't know how large a factor this is, though loud / rage quitters are certainly a prominent (if not necessarily large) cohort. Whether or not they're also influential, or perhaps more importantly when they become influential is another question. Again, this is a fairly common problem with any social experiment, including natural social experiments, see various forms of brain-drain and social flight.
- A/B testing tends to focus on short term changes and behaviours, which may mask longer-term outcomes. This has some overlap with the above, but also with subjects' general response to change. See the classic case of this in the Hawthorne Effect (<https://en.wikipedia.org/wiki/Hawthorne_effect>).
The upshot is that A/B testing can be valid and useful, but that experimental design, particularly in the case of social and psychological experiments, where subject feed back into the study and its methodology itself, is exceptionally thorny.