• Corpus ID: 244463155

Datasets for Online Controlled Experiments

  title={Datasets for Online Controlled Experiments},
  author={C. H. Bryan Liu and {\^A}ngelo Cardoso and P V Couturier and Emma J. McCoy},
Online Controlled Experiments (OCE) are the gold standard to measure impact and guide decisions for digital products and services. Despite many methodological advances in this area, the scarcity of public datasets and the lack of a systematic review and categorization hinder its development. We present the first survey and taxonomy for OCE datasets, which highlight the lack of a public dataset to support the design and running of experiments with adaptive stopping, an increasingly popular… 

Figures and Tables from this paper

A Bayesian Model for Online Activity Sample Sizes
This work presents a simple but novel Bayesian method for predicting the number of additional individuals who will participate during a subsequent period and illustrates the performance of the method in predicting sample size in online experimentation.


ASSISTments Dataset from Multiple Randomized Controlled Experiments
This dataset is presented consisting of data generated from 22 previously and currently running randomized controlled experiments inside the ASSIStments online learning platform, providing data mining opportunities for researchers to analyze ASSISTments data in a convenient format across multiple experiments at the same time.
Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas
This work provides a practical guide to applying the Delta method, one of the most important tools from the classic statistics literature, to address the aforementioned challenges of metric analytics.
Online controlled experiments at large scale
This work discusses why negative experiments, which degrade the user experience short term, should be run, given the learning value and long-term benefits, and designs a highly scalable system able to handle data at massive scale: hundreds of concurrent experiments, each containing millions of users.
Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods
This work develops a framework for understanding how dependence affects uncertainty in user-item experiments and evaluates how bootstrap methods that account for differing levels of dependence perform in practice, and highlights the importance of analysis of inferential methods for complex dependence structures common to online experiments.
Focusing on the Long-term: It's Good for Users and Business
The experiment methodology is developed and used to determine and quantify the drivers of ads blindness and sightedness, the phenomenon of users changing their inherent propensity to click on or interact with ads, and creates a model that uses metrics measurable in the short-term to predict the long-term.
Online Controlled Experimentation at Scale: An Empirical Survey on the Current State of A/B Testing
The findings show that, among others, companies typically develop in-house experimentation platforms, that these platforms are of various levels of maturity, and that designing key metrics - Overall Evaluation Criteria - remains the key challenge for successful experimentation.
Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments
This work proposes an objective Bayesian A/B testing framework for which it hope to bring the best from Bayesian and frequentist methods together, and successfully applied this method to Bing, using thousands of experiments to establish the priors.
A Large Scale Benchmark for Uplift Modeling
A publicly available collection of 25 million samples from a randomized control trial is released, scaling up previously available datasets by a healthy 590x factor and it is shown that the dataset size makes it now possible to reach statistical significance when evaluating baseline methods on the most challenging target.
How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments
This paper shared how it mined through historical A/B tests and identified the most common causes for invalid tests, ranging from biased design, self-selection bias to attempting to generalize A-B test result beyond the experiment population and time frame, and developed scalable algorithms to automatically detect invalid A/ B tests and diagnose the root cause of invalidity.
From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks
The experimentation platform at LinkedIn is described in depth and how it is built to handle each step of the A/B testing process at LinkedIn, from designing and deploying experiments to analyzing them.