Ddml: Double/Debiased Machine Learning in Stata

  title={Ddml: Double/Debiased Machine Learning in Stata},
  author={Achim Ahrens and Christian Hansen and Mark E. Schaffer and Thomas Wiemann},
  journal={SSRN Electronic Journal},
We introduce the package ddml for Double/Debiased Machine Learning (DDML) in Stata. Estimators of causal parameters for five different econometric models are supported, allowing for flexible estimation of causal effects of endogenous variables in settings with unknown functional forms and/or many exogenous variables. ddml is compatible with many existing supervised machine learning programs in Stata. We recommend using DDML in combination with stacking estimation which combines multiple machine… 

Tables from this paper



Double/Debiased Machine Learning for Treatment and Structural Parameters

This work revisits the classic semiparametric problem of inference on a low dimensional parameter θ_0 in the presence of high-dimensional nuisance parameters η_0 and proves that DML delivers point estimators that concentrate in a N^(-1/2)-neighborhood of the true parameter values and are approximately unbiased and normally distributed, which allows construction of valid confidence statements.

PDSLASSO: Stata module for post-selection and post-regularization OLS or IV estimation and inference

Pdslasso and ivlasso implement weak-identification-robust hypothesis tests and confidence sets using the Chernozhukov et al. (2013) sup-score test, which helps to estimate the causal impact of one or more causal variables of interest.

pystacked: Stacking generalization and machine learning in Stata

pystacked implements stacked generalization (Wolpert, 1992) for regression and binary classification via Python's scikit-learn via pystacked, which provides an easy-to-use API for scik it-learn's machine learning algorithms.

Program evaluation and causal inference with high-dimensional data

This paper shows that a key ingredient enabling honest inference is the use of orthogonal or doubly robust moment conditions in estimating certain reduced form functional parameters, and provides results on honest inference for (function-valued) parameters within this general framework where any high-quality, modern machine learning methods can be used to learn the nonparametric/high-dimensional components of the model.

Post-Selection and Post-Regularization Inference in Linear Models with Many Controls and Instruments

An approach to estimating structural parameters in the presence of many instruments and controls based on methods for estimating sparse high-dimensional models and extends Belloni, Chernozhukov and Hansen (2014), which covers selection of controls in models where the variable of interest is exogenous conditional on observables.

The random forest algorithm for statistical learning

This article overviews the random forest algorithm and illustrates its use with two examples, and introduces a corresponding new command, rforest, which is used to predict the logscaled number of shares of online news articles.

Omitted Variable Bias of Lasso-Based Inference Methods: A Finite Sample Analysis

It is shown that Lasso-based inference methods can exhibit substantial omitted variable biases (OVBs) due to Lasso not selecting relevant controls, and relying on the existing asymptotic inference theory can be problematic in empirical applications.


We show that, under a sparsity scenario, the Lasso estimator and the Dantzig selector exhibit similar behavior. For both methods, we derive, in parallel, oracle inequalities for the prediction risk

Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain

A fully data-driven method for choosing the user-specified penalty that must be provided in obtaining LASSO and Post-LASSO estimates is provided and its asymptotic validity under non-Gaussian, heteroscedastic disturbances is established.

Generalized random forests

A flexible, computationally efficient algorithm for growing generalized random forests, an adaptive weighting function derived from a forest designed to express heterogeneity in the specified quantity of interest, and an estimator for their asymptotic variance that enables valid confidence intervals are proposed.