I'm not sure what you mean by this. Many quality studies are A/B tests. A/B just refers to the two IV states you're testing, which you're then observing a DV - sales, engagement, errors, etc.
A/B tests can be double blinded (don't tell the error monitoring people which results are from a trial), and have high number of samples, far beyond even most pharmaceutical trials.
They can also be really crappy, changing too many variables at once, etc. But they are certainly "real science".
EDIT: an example, Drug vs placebo - is an A/B test.
For example with changing the font size of a button:
Your null hypothesis is there is no difference in the number of clicks. Your alternative hypothesis is that there is an increase in number of clicks.
Your IV is the button font size. Your DV is the number of button clicks over a set period of time.
You randomly sample 50% of the population to State A (same button size) You put the other group into State B (increased button size)
You observe the number of clicks of the button.
You analyze this data, and can determine the statistical significance between your null and alternative hypothesis.
Science is more “what’s true if humans didn’t exist.”
Marketing is more “what widget generates more revenue?”
Did it? The nature of Tesla and his other current businesses buffer that a bit even if it has been his approach, and it seems to have gotten him thrown out as CEO at X.com twice; among the things going on Twitter seems to be Musk trying to relitigate his failure at X.com without other investors being in a position to kick him out, but he seems to be piling up existential threats without resolving them.
Where A/B studies may go wrong in my view is a few other elements:
- A/B studies have difficulty in determining differences based on multiple interacting characteristics. In fairness, so does empirical science, and the principle of "holding all else constant" is a frequent assumption of scientific processes.
- A/B studies face an inherent self-selection / exclusion bias: the participants in this round of A/B testing are those who've not been driven off the project/product from past experiments and design changes. Given that many Web 2.0 companies eventually dance with pushing people right up to the border of tolerance, it's quite possible that A/B testing has a long-term effect of pushing those participants whose tolerance has been exceeded out of the study population entirely. I don't know how large a factor this is, though loud / rage quitters are certainly a prominent (if not necessarily large) cohort. Whether or not they're also influential, or perhaps more importantly when they become influential is another question. Again, this is a fairly common problem with any social experiment, including natural social experiments, see various forms of brain-drain and social flight.
- A/B testing tends to focus on short term changes and behaviours, which may mask longer-term outcomes. This has some overlap with the above, but also with subjects' general response to change. See the classic case of this in the Hawthorne Effect (<https://en.wikipedia.org/wiki/Hawthorne_effect>).
The upshot is that A/B testing can be valid and useful, but that experimental design, particularly in the case of social and psychological experiments, where subject feed back into the study and its methodology itself, is exceptionally thorny.
I'd compare this to how evaporation rate increases with temperature, as more particles find themselves with enough energy to escape the liquid.
From my personal experience, even if I can tolerate a lot of UX abuse, each such "optimization" lowers my threshold of switching to a competitor. Software in general, and SaaS specifically, resists commoditization, but every now and then an actual alternative to a product/service I'm using shows up - and whether or not I switch (and when) is correlated with how much I resent the incumbent for their UX "improvements".
I'd add one bullet point to your list:
- Unlike regular scientific experimentation, A/B testing is a methodology primarily spread in business circles using regular hype channels. That is, the average practitioner is not qualified to execute it correctly, which is one of the reasons I see A/B testing more as tools to launder arbitrary decisions. Because consequences of doing it wrong are typically not immediately apparent or obvious, both companies and customers suffer (and a vast space for fraudsters is created).
I'm in a charitable mood, so I'm not passing judgement on people for not having PhD-level understanding of statistics - just pointing out that, to the degree much larger than in sciences (even soft ones, which suffer some of the same structural problems), there's little pressure to do such tests correctly (and there's lot of ways to make money or status by doing them without regards for correctness).
From what I hear, a common way of executing A/B test badly and getting bullshit results, is by terminating the test early when it shows the relevant metrics improving for the test group - vs. running it longer if no big improvements are observed (or the metrics start getting worse for the test group). This biases the experiment towards giving false positives. This problem was big enough that there was a debacle around Optimizely few years ago, whose UI was accused to promote this early termination of tests. The cynical take I'm still somewhat partial to is that it wasn't an accident (if not done on purpose, then possibly... a result of an A/B test!) - false positives make the (statistically naive) users feel they're getting more value from Optimizely than they actually are.
There's a reason that the technical term for "A/B testing in SaaS products" is gaslighting.
O hai werd uv yeer!
Sociologists will frequently not have as good access to such a large participant pool under near ideal experimental conditions with such good ways to observe behavior. And the stuff you have to keep in mind when running experiments is not terribly complex. A bit of statistics, a few things you absolutely have to get right, that’s it.
Obviously there are reasons why AB tests are often not run rigorously (statistical illiteracy and pressure to get things done quick as well as to produce tangible results as often as possible – all three of which might lead you to run underpowered experiments with too few participants and to stop testing early which will lead to too many false positives). However, stopping to do experiments (and instead just releasing new stuff and observing the reaction) isn’t really an improvement that leads to better outcomes compared to that.
A case could be made that A/B testing is insufficiently rigorous given specific goals, resources, limitations, context, etc. But that case isn't being made here.
Thanks though. I've narrowed my original comment to more accurately represent the scope I am referring to.
Science is just observation and experimentation.
Science doesn't dictate how you do the above. Now, someone would find it impossible to reproduce your findings, but - that would just suggest bad science
For example, if your metric is "time spent interacting on the platform", then a testing of a rollout of a feature ends up with longer page load times, so users spend more time there because they're waiting for pages to load would increase that metric, and management decides it's a good idea.
That's not enough. If you don't include some sense of both 'systematic' and 'rigorous' (and yes, these terms are slippery), you aren't doing science.
I mean, I'm sure the parameters to the math are proprietary. But the basic math seems simple enough.
Trying to tease out the pieces that aren't coupled to Twitter's User class is probably more effort than it's worth