A team of researchers examined 2,101 commercial experiments facilitated by A/B splitting tools like Google Optimize, Mixpanel, Monetate and Optimizely and used regression analysis to detect whether p-hacking (previously), a statistical cheating technique that makes it look like you've found a valid cause-and-effect relationship when you haven't, had taken place.
They found that 57% of experimenters were p-hacking by halting the experiment once it looked like their initial hypothesis was borne out, without bothering to complete the run and possibly discovering some disconfirming data.
The researchers hypothesize that the cheating is the result of poor statistical ability, bugs in the tools that encourage bad statistical practice, and a desire to please your boss by either proving that you had an idea that was borne out by data, or by proving that your boss was right when they pronounced that things would work a certain way.
The behavior of experimenters in our data seems to deviate from profit maximization. If the
experiments are run to maximize learning about effect sizes while ignoring short term profits, we
should not observe p-hacking that inflates FDRs. If, in contrast, experiments are run to maximize
profits, we should not observe experiments with larger effect sizes being terminated later, as this
prevents the most effective intervention from being rolled out quickly.
Finally, on a more positive note, we find that stopping an experiment early or late is not driven
solely by p-hacking. Specifically, we find a pronounced day-of-the-week pattern, a 7-day cycle in
the first 35 days, and a tendency to terminate sooner when observing effects small rather than large
p-Hacking and False Discovery in A/B Testing [Ron Berman, Leonid Pekelis, Aisling Scott and Christophe Van den Bulte/SSRN]
(via Four Short Links)
(Image: http://www.beeze.de, CC-BY)