"Network Experimentation at Scale" from Facebook describes how difficult this problem is. Most A/B test frameworks don't reach this level of sophistication. It does make some sense to just ship things if you don't have time to build out something like that. (disclosure: I worked at Twitter long ago)