Double machine learning

Interesting paper by some fairly serious MIT-related econometrics/statistics people (let’s call this paper DML) on how to estimate some low-dimensional parameters of interest in the presence of high-dimensional nuisance parameters. The intuition is quite nice, though the actual theoretical results are maybe not as generally applicable as they might appear at first.

I. Setup and results

we want to estimate some finite-dimensional parameter \(\theta\)
there is some very complicated nuisance parameter \(\eta\) that we don’t care about
- we’ll want to use fancy machine learning stuff to deal with \(\eta\)
the true value of \(\theta\) and \(\eta\) are \(\theta_0\) and \(\eta_0\)
we have some score function \(\phi\) that is zero only at the true parameter values: \(\begin{equation} \mathbb{E}[\phi(X, \theta_0, \eta_0)] = 0 \tag{1} \label{momentcondition} \end{equation}\) where data \(X\) is generated IID from some distribution \(\mathbb{P}\), and \(\mathbb{E}\) denote the expectation wrt \(\mathbb{P}\)
for a concrete example, think about the case of a simple linear regression:
- \(\phi\) would be the derivative of the least squares objective
- \(\theta\) and \(\eta\) would be the coefficients
- \eqref{momentcondition} would be the first order conditions, which hold only at the true parmeter values

DML recommends this procedure:

use a score \(\phi\) that satisfies a Neyman orthogonality condition:
- the directional derivitive \(\partial_{\eta}\mathbb{E}[\phi(X,\theta_0,\eta_0)][\eta-\eta_0] = 0\) for every \(\eta\)
  - with \(\partial_{\eta}f(z_0)[z-z_0] := \partial_t f(z_0+t(z-z_0))\)
- often, you can transform your \(\phi\) so that it satisfies Neyman orthogonality
do cross-fitting, where you split your data into a main sample and an auxiliary sample, and:
- use the auxiliary sample to train some fancy machine learning model to predict the nuisance parameter \(\eta\)
- on the main sample:
  - use this trained model to get \(\hat{\eta}\)
  - then plug this into the empirical analogue of \eqref{momentcondition} and solve for \(\theta\), so that \(\mathbb{E}_n[\phi(X, \hat{\theta}, \hat{\eta})] = 0\) implicitly defines \(\hat{\theta}\) (where \(\mathbb{E}_n\) here denotes averaging over the main sample)
this \(\hat{\theta}\) will then be \(\sqrt{n}\)-consistent for \(\theta_0\)

These two things (Neyman orthogonality and cross-fitting) are what gives us \(\sqrt{n}\)-consistency:

Neyman orthogonality makes \(\phi\) insensitive to the nuisance parameter \(\eta\) near the truth, so that plugging in an estimate of \(\eta\) won’t hurt too much.
Cross-fitting (as opposed to using the same data for both steps above) gives us a better estimate of \(\hat{\eta}\), since the main sample where we plug \(\hat{\eta}\) in to \(\phi\) and then estimate \(\hat{\theta}\) isn’t use when training the fancy ML model for predicting \(\phi\) and thus \(\hat{\eta}\) on the sample shouldn’t suffer from much overfitting.

We’ll give a slightly deeper intuitive understanding for where these two conditions come from, as well as some considerations on how applicable this might be in practice.

II. Intuition

II.1. First, without the nuisance parameter:

If we knew the true value of the nuisance parameter \(\eta_0\), we could just plug it into the empirical analogue of \eqref{momentcondition} and solve for \(\theta\) to get an estimate \(\hat{\theta}_n\):

\[\mathbb{E}_n[\phi(X, \hat{\theta}_n, \eta_0)] = 0\]

For sufficiently large \(n\), \(\hat{\theta}_n\) will be pretty close to \(\theta_0\) and \(\phi\) will be approximately linear, so we can get the asymptotic distribution of \(\hat{\theta}\):

\[\begin{align} 0 &= \mathbb{E}_n\phi(X, \hat{\theta}_n, \eta_0) \approx \mathbb{E}_n\phi(X, \theta_0, \eta_0) + \partial_{\theta}\mathbb{E}_n[\phi(X, \theta_0, \eta_0)](\hat{\theta}_n-\theta_0) \newline &\Rightarrow \partial_{\theta}\mathbb{E}_n[\phi(X, \theta_0, \eta_0)] \sqrt{n} (\hat{\theta}_n-\theta_0) \approx - \sqrt{n}\mathbb{E}_n\phi(X, \theta_0, \eta_0) \newline &\Rightarrow \sqrt{n} (\hat{\theta}_n-\theta_0) \rightarrow_d N(0, J^{-1}\Omega J^{-1\prime}) \tag{2} \label{expansion0} \end{align}\]

where \(J:= \partial_{\theta}\mathbb{E}[\phi(X, \theta_0, \eta_0)]\) and \(\Omega := \mathrm{Var}\left[\phi(X, \theta_0, \eta_0)\right]\).

That is, assuming we know the true nuisance parameter \(\eta_0\), our estimator \(\hat{\theta}_n\) is \(\sqrt{n}\)-consistent, which is pretty nice.

II.2. Now, with the nuisance parameter:

Unfortunately:

we don’t know \(\eta_0\)
we still need to estimate \(\theta_0\)
we’d still like to get this same \(\sqrt{n}\)-consistency

To estimate \(\theta_0\) in this case, probably the most natural thing to do is:

get a preliminary estimate for \(\hat{\eta}_n\) of \(\eta\)
pretend \(\hat{\eta}_n \approx \eta_0\), and then do the same thing we did before where we solve for \(\hat{\theta}_n\) using the empirical analogue of \eqref{momentcondition}: \(\begin{equation} \mathbb{E}_n[\phi(X, \hat{\theta}_n, \hat{\eta}_n)] = 0 \tag{3} \label{empiricalmomentcondition} \end{equation}\)

Note that so long as this nuisance parameter estimate \(\hat{\eta}_n\) converges to the truth and \(\phi\) is smooth, the argmax wrt \(\theta\) of \(\mathbb{E}_n[\phi]\) should also be smooth, so we should still get \(\hat{\theta}\rightarrow\theta_0\). So we can expand \eqref{empiricalmomentcondition} around \(\theta_0\):

\[\begin{align} 0 &= \mathbb{E}_n[\phi(X, \hat{\theta}_n, \hat{\eta}_n)] \approx \mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_n)] + \partial_{\theta}\mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_n)] (\hat{\theta}_n-\theta_0) \newline &\Rightarrow \partial_{\theta}\mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_n)] (\hat{\theta}_n-\theta_0) \sqrt{n} \approx - \sqrt{n} \mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_n)] \tag{4} \label{expansion1} \end{align}\]

Basically, this looks a lot like \eqref{expansion0}, except with some \(\hat{\eta}_n\)s instead of \(\eta_0\). If we could make this look exactly like \eqref{expansion0} as \(n\) gets big, then we would get \(\sqrt{n}\)-consistency of \(\hat{\theta}_n\) even in this case where we don’t know the true \(\eta_0\) and have to plug in \(\hat{\eta}_n\).

the left hand side is easy:
- \(\partial_{\theta}\mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_n)]\) is just an average
- so it should easily go to \(J\) as \(n\) gets big so long as \(\phi\) is smooth and \(\hat{\eta}_n\) converges to \(\eta\)
- so the left side of \eqref{expansion1} will look like the left side of \eqref{expansion0} as \(n\) gets big
the right hand side \(\sqrt{n} \mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_n)]\) is more involved:
- in order to make this resemble the right side of \eqref{expansion0}, we need \(\sqrt{n} \mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_n)] - \sqrt{n} \mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_0)] \rightarrow_p 0\)
- can use a stochastic equicontinuity argument if we’re using non-ML methods to get \(\hat{\eta}_n\) (see Andrews 1994)
- but that argument minimally requires the set of potential \(\hat{\eta}_n\) have finite VC dimension, whereas in modern ML applications typically we’ll fit increasingly complex functions as sample size increases, so this kind of classical argument doesn’t work

So instead, let’s continue with \eqref{expansion1} and expand the RHS:

\(\begin{align} & \partial_{\theta}\mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_n)] (\hat{\theta}_n-\theta_0) \sqrt{n} \newline &\approx - \sqrt{n} \mathbb{E}_n[\phi(X, \theta_0, \hat{\eta}_n)] \newline &\approx - \sqrt{n} \mathbb{E}_n \left[ \phi(X, \theta_0, \eta_0) + \partial_{\eta}\phi(X, \theta_0, \eta_0)[\hat{\eta}_n-\eta_0] + \frac{1}{2}\partial_{\eta^2}\phi(X, \theta_0, \eta_0)[\hat{\eta}_n-\eta_0] \right] \tag{5} \label{expansion2} \end{align}\) where :

\(\partial_{\eta}\phi(X, \theta_0, \eta_0)[\hat{\eta}_n-\eta_0]\) is the directional derivative of \(\phi\) wrt \(\eta\) in the direction of \([\hat{\eta}-\eta_0]\)
\(\partial_{\eta^2}\phi(X, \theta_0, \eta_0)[\hat{\eta}_n-\eta_0]\) the second derivative in that direction
we’ve included the second derivative here since it’s not ex-ante obvious that it’ll vanish

In order to make \eqref{expansion2} look like \eqref{expansion0} (and thus get \(\sqrt{n}\)-consistency for \(\hat{\theta}_n\)), we just need \(\sqrt{n} \mathbb{E}_n[\partial_{\eta}\phi(X, \theta_0, \eta_0)[\hat{\eta}_n-\eta_0]]\) and \(\sqrt{n} \mathbb{E}_n[\partial_{\eta^2}\phi(X, \theta_0, \eta_0)[\hat{\eta}_n-\eta_0]]\) to both go to 0. Here’s where Neyman orthogonality and cross-fitting come in:

the Neyman (near)-orthogonality condition pretty much just amounts to assuming that \(\sqrt{n} \mathbb{E}_n[\partial_{\eta}\phi(X, \theta_0, \eta_0)[\hat{\eta}_n-\eta_0]]\) goes to 0
Cross-fitting + assumption that \(\lvert\lvert\hat{\eta}-\eta_0\rvert\rvert_2 = o_p(n^{-1/4})\) on hold-out data gives us \(\sqrt{n} \mathbb{E}_n[\partial_{\eta^2}\phi(X, \theta_0, \eta_0)[\hat{\eta}-\eta_0]]\) going to 0
- the \(\lvert\lvert\hat{\eta}_n-\eta_0\rvert\rvert_2 = o_p(n^{-1/4})\) guarantees that \(\mathbb{E}_n[\partial_{\eta^2}\phi(X, \theta_0, \eta_0)[\hat{\eta}_n-\eta_0]]\) will be \(o_p(n^{-1/2})\)
  - to see this, just consider the case where \(\eta\) is 1-dimensional, where this second directional derivative is just \(\partial_{\eta^2}\phi(X, \theta_0, \eta_0) (\hat{\eta}_n-\eta_0)^2\)
  - basically the same thing in general so long as \(\phi\) is smooth
- the cross-fitting just means that we estimate \(\hat{\eta}_n\) on an auxiliary sample that we then don’t re-use when estimating \(\theta\), so that we get the \(o(n^{-1/4})\) rate for \(\hat{\eta}_n\) when where we’re plugging in \(\hat{\eta}_n\) and using it to estimate \(\theta\) the main sample

A bit more intuition:

that we need a Neyman orthogonality condition seems reasonable:
- we’re plugging in an estimate of the nuisance parameter \(\eta\) to stand in for the truth \(\eta_0\) when we estimate \(\theta\)
- so if the equation we solve to estimate \(\theta\) depends on the value of this nuisance parameter then small errors in the nuisance parameter will mess things up
- Neyman orthogonality just says that this depedence gets small as our data gets big
estimating \(\eta\) on an auxiliary data set separate from the data we use to estimate \(\theta\) is basically to just control for overfitting
- if we used the same data for estimating the \(\eta\) as \(\theta\), then the expectations in \eqref{expansion2} would all be relative to the data that \(\eta\) was estimated on
- as a result, these in-sample estimates of \(\eta\) could overfit, and thus might not converge to \(\eta_0\) at the required rate
- as an aside: if we don’t do cross-fitting, but rather find other ways to limit in-sample overfitting, then everything should still be fine
  - the DML authors have some other work where they take this approach

III. Caveats

III.1. The set of ML algorithms DML theory applies to

In general, \(\sqrt{n}\)-consistency of \(\hat{\theta}_n\) requires that \(\lvert\lvert\hat{\eta}-\eta_0\rvert\rvert_2 = o_p(n^{-1/4})\). That is, whatever machine learning algorithm you’re using to approximate \(\eta\) has to exhibit \(L_2\) convergence at this rate. Even in the special case where the second derivative \(\mathbb{E}[\partial_{\eta^2}\phi(X, \theta_0, \eta_0)[\eta-\eta_0]]\) is 0 for all \(\eta\), you still need \(L_2\) convergence. The DML paper lists several examples of algorithms / problem settings with existing theoretical results that give this rate, but none of them are really that close to things you might do in practice:

Lasso for sparse models: this applies when the nuisance parameter is some linear function of a sparse set of parameters, which is a bit of an implausible assumption most of the time.
Neural networks: the Chen White neural network result is for shallow feedforward networks.
Boosting: the Luo Spindler result assumes the truth is a sparse linear model, and the base learning in the boosting algorithm here are univariate linear regressions, which is quite different from the popular tree boosting stuff that people often do in practice
Trees/forests: the Wager Walther concentration result applies to convergence of a trained tree to the best theoretical tree with the same splits, and doesn’t say anything about approximating some true conditional mean.

Also, I assume the Wager Athey random forest result is not mentioned because that convergence result is pointwise rather than \(L_2\).

So, DML doesn’t actually provide much in the way of theoretical guarantees for estimating the nuisance parameter \(\eta\) via ML methods that provide actually competitive predictive performance, e.g. tree boosting / random forests / deeper neural networks, since the necessary convergence results don’t yet exist for these methods.

III.2. The Neyman orthogonalization procedure

The DML procedure requires that we have some score \(\phi\) that satisfies Neyman orthogonality. This is generically not the case, so we would like to have some way of transforming an arbitrary \(\phi\) into something that satisfies Neyman orthogonality. The DML paper provides illustrations of how to do this for a variety of different cases, and applies them to specific leading examples that economists find interesting. However, it’s not entirely clear how generalizable these procedures are, especially when \(\eta\) is a complicated thing we’re approximating by general ML methods.

For example, the ‘concentrating-out’ approach for Neyman orthogonalization in the case of M-estimation with infinite-dimensional nuisance parameters (section 2.23 in the DML paper) requires computing the optimal nuisance parameter \(\eta\) as a function of the parameter of interest \(\theta\), and then computing the derivative wrt \(\theta\) of this function-valued mapping from \(\theta\) to optimal \(\eta\). This approach is then applied to the leading example of interest (partially linear models with nuisance parameter) where the mapping from \(\theta\) to optimal \(\eta\) is fairly easy to compute and differentiate, due to the particular functional forms involved. It seems like in most cases that are less straightforward, it’s not going to be so easy / possible to do this.

The DML paper presents a variety of other ways to orthogonalize stuff, and works out some more economics-relevant examples. However, in general cases with high-dimensional nuisanace parameters, it doesn’t appear that there’s a mechanical way to orthogonalize the score. As a result, it may be necessary to manually orthogonalize before the DML procedure can be applied.

J. Mark Hou