Here’s a pretty fun paper on deep instrumental variables. Deep learning is actually totally ancillary to the core idea of the paper, which is that you can approach nonparametric instrumental variables as two supervised learning problems. Here, I’ll do two things:

### I. Background / theory

#### I.1. Instrumental variables (IV)

People are often interested in how things affect other things:

1. If you sell stuff, you might care about how prices affect number of sales.
2. If you run a social media company, you might be interested in how the number of ads you show affect how much people use your platform.
3. If you work in education policy, you might care about how education affects peoples’ lifetime earnings.

This general situation can be described as follows:

• $y = g(x) + \epsilon$, where $y$ and $x$ are observable but $\epsilon$ is not
• you’re trying to learn what $g$ is from your data.

To see how each of the examples above maps to the formula above:

1. $y$ is how many units something sold, $x$ is the price.
2. $y$ is the number of hours someone spends on your social media site, $x$ is the fraction of their news feed that’s ads.
3. $y$ is somebody’s total lifetime earnings, $x$ is their education level.

So, the most natural thing you can do to figure out how $x$ affects $y$ is to just… look at your data. As in, for a given $x$, you can look at all the various $y$s associated with that, and then just average them to get $\hat{g}(x):=\mathbb{E}_n[y|x]$. It’s clear that if $\mathbb{E}[\epsilon \mid x]=0$, then $\hat{g}(x)\rightarrow g(x)$ for all $x$, so that as you get more and more data you’ll eventually learn the function $g$.

This method of requires that $\mathbb{E}[\epsilon \mid x]=0$. Unfortunately, this assumption is often violated. For each of the above examples:

1. The higher priced items are also likely to be items that people can’t buy anywhere else, and as such may sell more, so that the ‘substitutability’ of an item drives a part of $\epsilon$, so that $\epsilon$ is correlated with price $x$.
2. Your social media platform’s machine learning team has already optimized ad targeting a bit, so that people that get shown more ads are naturally more patient, and more patient people spend more time on your platform, so that patience drives a part of $\epsilon$, so that $\epsilon$ is correlated with number of ads the person sees $x$.
3. People who are more intelligent tend to get more education, but intelligence itself also affects how much you earn, so that intelligence drives a part of $\epsilon$, so that $\epsilon$ is correlated with education level $x$.

To deal with this problem, people often turn to something called ‘instrumental variables’. An instrumental variable $z$ is something such that:

• $z$ affects $x$
• $z$ doesn’t affect $y$ via any other channels, only through its impact on $x$
• $z$ is independent of $\epsilon$

In the pricing example, you might run an experiment where you randomly discount some products by some percentage. The discount amount has an obvious affect on the price $x$, is independent of quality or anything else affecting $\epsilon$ (because it’s random), and (probably) doesn’t affect purchases except through price. So the discount percentage would be your instrument $z$.

With your instrument, you can basically do something like this:

• look at variation in $z$
• this variation in $z$ affects $x$, which then affects $y$
• so, the change in $y$ when $z$ varies must be due purely to changes $x$, since by assumption:
• $z$ has no effect on $\epsilon$, and
• $z$ has no other impact on $y$
• this allows you to deduce the impact of $x$ on $y$.

#### I.2. Nonparametric IV, more formally

(Note: you can see Newey Powell 2003 for more details. This is called ‘nonparametric’ because we’re treating the function $g$ as a general function, as opposed to being defined by a fixed set of parameters.)

Alright, let’s formally define our setup: $$y=g(x)+\epsilon,\quad \mathbb{E}[e\mid z]=0, \quad \text{possibly } \mathbb{E}[e\mid x]\neq 0 \tag{1} \label{ivdef}$$

• $y, x, z$ are all observable, $\epsilon$ is not
• $y$ is 1-dimensional, $x$ and $z$ can be multidimensional
• $y$ is your outcome variable, $x$ are covariates, $z$ are instruments

So, the formal version of the argument given at the end of the previous section is basically integrating the expression in \eqref{ivdef}: $$$\mathbb{E}[y\mid z] = \int dF(x\mid z) g(x) \tag{2} \label{fredholm}$$$ Some points:

• $\epsilon$ drops out because $\mathbb{E}[e\mid z]=0$.
• $F(x\mid z)$ is observable, since it’s just the distribution of $x$ given $z$.
• $\mathbb{E}[y\mid z]$ is observable, since it’s just the average of $y$ given $z$.
• So, the problem of figuring out what $g$ is amounts to solving for $g$ given this equation.

Alright, let’s do some examples of equation \eqref{fredholm} to fix intuition:

• Suppose $x=z$.
• In that case, $F(x\mid z)$ is just the delta distribution, supported purely on $x=z$.
• So, \eqref{fredholm} reduces to $\mathbb{E}[y\mid z] = g(x)\mid_{x=z}$, which is just $\mathbb{E}[y\mid x] = g(x)$.
• This makes sense, since if $x=z$ then there’s no need for an instrument. Your $x$ is independent of $\epsilon$. So you can just literally look at $\mathbb{E}[y\mid x]$ to get $g(x)$.
• Suppose $z$ has no impact on $x$, as in $F(x\mid z) = F(x)$ for all $z$.
• In that case, \eqref{fredholm} becomes $\mathbb{E}[y\mid z] = \int dF(x) g(x)$ which implies that $\mathbb{E}[y\mid z] = \mathbb{E}[g(x)]$.
• Thus, you’re not going to be able to learn $g(x)$ in this case (unless $g$ is constant).
• This is also intuitive: if your instrument $z$ doesn’t have any impact on your $x$, then it’s not possible for you to deduce the impact of $x$ on $y$ via this instrument because changing this instrument doesn’t change your $x$ at all.
• Suppose $g(x)=x\cdot\beta$, and also that $\mathbb{E}[x\mid z]=Az$ for some vector $\beta$ and $A$ where $A$ has rank $length(x)$.
• \eqref{fredholm} becomes $\mathbb{E}[y\mid z] = \int dF(x\mid z) x\cdot\beta$ which implies $\mathbb{E}[y\mid z] = \mathbb{E}[x\mid z]\cdot\beta = (Az)\cdot\beta$.
• You can estimate $A$ since $\mathbb{E}[x\mid z]=Az$ and you observe $\mathbb{E}[x\mid z]$ and $z$.
• You can estimate $\beta$ since you observe $Az$ and $\mathbb{E}[y\mid z]$ and $Az$ spans the same space as $x$ as you vary $z$.
• This is just your typical linear instrumental variables setup, where first you estimate how your $z$ affects your $x$ (which is the $A$), and then plug in these estimates of $x$ given $z$ (which is $Az$) to estimate $\beta$.

#### I.3. Identification? Estimation?

Given that we want to back out $g$ using \eqref{fredholm}, it would be nice if there were some identification conditions telling us whether/when there’s only a single function $g$ that satisfies \eqref{fredholm} out of some class of functions $\mathcal{G}$. Identification holds IFF a ‘completeness’ condition holds. Roughly this completeness condition amounts to a requirement that any random variable $\tilde{g}(x)$ for $\tilde{g}\in \mathcal{G}$ must somehow not exploit any information that’s present in $x$ but not in $z$:

• Note that any function $\tilde{g}$ that satisfies $\eqref{fredholm}$ must have $\mathbb{E}[y\mid z] = \int dF(x\mid z) \tilde{g}(x)$
• So $0 = \mathbb{E}[\tilde{g}(x)-g(x) \mid z]$.
• By definition $g$ is identified exactly when $\mathbb{E}[\tilde{g}(x) \mid z]=\mathbb{E}[g(x) \mid z]$ implies $\tilde{g}=g$. This last thing is the ‘completeness’ condition.

Let’s look again at the various examples above to understand this completeness condition:

• If $x=z$, then clearly fixing $z$ exactly pins down $g(x)=g(z)$, so this condtion holds, and as thus $g$ is identified.
• If $z$ has no impact on $x$, then any $\tilde{g}$ that averages out to $\mathbb{E}[g]$ is going to satisfy $\mathbb{E}[\tilde{g}(x) \mid z]=\mathbb{E}[g(x) \mid z]=\mathbb{E}[g(x)]$, so that we will not be able to identify any $g$ that’s not constant.
• If $g(x)=x\cdot\beta$, and also that $\mathbb{E}[x\mid z]=Az$ with $A$ full rank, then $x$ may contain more information than $z$, but the full rank condition guarantees that varying $z$ induces variation in $x$ in as many directions as $x$ itself, so that the $\mathbb{E}[x\mid z]$ spans the same space as $x$. Then, because $g=x\cdot\beta$ is linear, the extra information that $x$ has that’s not in $z$ is irrelevant, as variation in $\mathbb{E}[x\mid z]$ suffices to recover $\beta$, so that $g$ is identified.

There are some theoretical results on this completeness condition, though maybe not as extensive as we’d like.

• There are fairly simple situations where completeness fails (see Severini Tripathi 2006).
• But completeness is ‘generic’ in some sense so long as you have as many instruments $z$ as regressors $x$ (see Andrews 2011).
• The sense in which completeness is generic is an extension of the standard ‘measure-1 wrt Lebesgue’ definition extended to infinite dimensional spaces.
• The genericness is relative to a fairly expansive looking class of distributions (as in, consider the set of all distributions in this fairly expansive class, then the complete ones are generic).
• This seems like a pretty useful result that suggests we shouldn’t worry too much, though I’m not entirely convinced I trust my intuition about these kind of unnatural infinite-dimensional objects.
• Also, it’s impossible to perform hypothesis testing for whether completeness holds, based on observations of $x,z$ (see Canay Santos Shaikh 2013).
• Though this untestability is due to the presence of incomplete distributions that are arbitrarily close to any complete distribution, and if you kind of lump these arbitrarily-close-to-being-complete distributions with the complete ones, testing becomes possible (see Freyberger 2017).

In practice, identification is good, but if we want to estimate $g$ we need some sort of consistency result. On this front, we have the problem that the solution to \eqref{fredholm} may not be a continuous function of $\mathbb{E}[y\mid z]$ and $F(x\mid z)$. This discontinuity is a problem, because typically we’ll estimate stuff by getting a bunch of data so that our empirical versions of $\mathbb{E}[y\mid z]$ and $F(x\mid z)$ converge towards the truth, and then we invoke an argument by continuity that our resulting estimate of $g$ will also converge towards the truth as we get more data. If we have no guarantee of this continuity then there’s no guarantee that that our estimate of $g$ will converge to the truth even if our estimates of $\mathbb{E}[y\mid z]$ and $F(x\mid z)$ become arbitrarily good. You can often circumvent this kind of issue by regularizing things appropriately, but in that case you’ll have to argue consistency on a case by case basis. For example, Newey Powell 2003 provides some consistency results assuming that $g$ is some linear combination of some basis functions + $L^2$ regularization.

Overall, general theory for identification/estimation of nonparametric IV is somewhat incomplete, so that we don’t have the kind of totally solid guarantees that maybe we’d like.

#### I.4. Deep IV

Thankfully, we’re now at a point in history where it’s not too insane to just throw function approximators at various problems and have it work pretty well even if it’s not entirely clear why. So, even with a lack of theoretical guarantees about identification / estimation, it’s not a bad idea to just go ahead and just try to do something reasonable, and hope that things work out.

The core idea of the Deep IV paper is to do this very reasonable thing:

1. Do standard supervised learning to estimate $F(x\mid z)$. This is ‘stage1’.
2. Use your $\hat{F}(x\mid z)$ as a proxy for the truth $F(x\mid z)$, find $\hat{g}$ so that $\mathbb{E}[y\mid z] = \int d\hat{F}(x\mid z) \hat{g}(x)$ is approximately satisfied in your data. This is ‘stage2’

This is quite intuitive: if we have lots of data, we can get a pretty good estimate of $F(x\mid z)$, and also of $\mathbb{E}[y\mid z]$. We’re going to hope that we live in a world where $g$ is identified and not too ill-posed. If that’s the case, then as we get a lot of data, we’ll be able to learn $g$.

This strategy is also quite nice, because it breaks the problem of estimating $g$ into two supervised leanring problems, both of which are fairly easy to throw function approximators at.

1. For the stage1, both $x$ and $z$ are observable. The complication here is that, instead of trying to predict $\mathbb{E}[x\mid z]$ as a function of $z$, you’re trying to predict the entire distribution of $x$ given $z$. Distributions are generally much more difficult to predict, particularly if $x$ is multidimensional, but there’s various things you can do. Two immediate examples:
• Assume there’s some parametric form for $F(x\mid z)$, so you’re only predicting a finite number of parameters. E.g. assume normal / mixture of normals (this is what the Deep IV paper does).
• If the distribution of $x\mid z$ is just 1-dimensional, then you can just estimate quantiles of $F(x\mid z)$.
• Discretize your $x$-space and estimate a categorical distribution on the grid.
2. For stage2, this is also a supervised learning problem: You have $N$ observations $y_i, x_i, z_i$, and you’re trying to find some $\hat{g}$ such that \eqref{fredholm} holds approximately in your data. That is, you want to pickg $g$ that makes $\mathbb{E}[y\mid z]$ and $\int d\hat{F}(x\mid z) \hat{g}(x)$ very close to each other in your data. So, you can think about minimizing the squared loss:

where we’ve replaced the internal integral with a sum, because the integral $\int d\hat{F}(x\mid z_i) \tilde{g}(x))$ will need to be implemented by sampling a bunch of $x_{ij}\sim \hat{F}(x\mid z_i)$ and then taking a sum.

• You can parameterize your $\tilde{g}$ as a function of some stuff, in which case you can just go ahead and do gradient descent or something like that. The Deep IV paper uses a neural network here.
• There’s some amount of complication here, in that there’s a sum on the inside of your square that you need to evaluate every time you want to evaluate the objective function, and that’s kind of hard to deal with. So, one thing you can do is to just move the sum outside of the square. This amounts to using a different objective function (that upper bounds the previous one via Jensen):

With this, you can just randomly draw a bunch of $\hat{x}_{ij}$s for each $z_i$ according to $\hat{F}(x\mid z_i)$, treat this new set of $(y_i, \hat{x}_{ij}, z_i)$ as your data and then just do straight-up supervised learning on this instead. However, just swapping out one objective function for another is a bit questionable, and certainly if we had consistency before we would lose it by doing this (as we’ll see below, this fails in quite trivial cases). But at the end of the day, everything that we’re doing here is a bit suspicious from a theoretical perspective, and this swapping out kind of feels not that unreasonable. So, worth trying out, especially because it’s easy.

Ok, so now regarding the ‘deep’ in DeepIV: for both stage1 and stage2, the paper proposes to use neural networks. The stage1 model uses a neural network to learn a mixture of gaussians to approximate $F(x\mid z)$, and stage2 uses a neural network to optimize one of the two objectives \eqref{trueobj} or \eqref{upperobj}.

### II. Re-implementation

#### II.1. Implementation details

Deep learning is great, but we can just as well use other function approximators in its stead. Especially for applications domains other than image/text, more classical methods may often be preferable for ease/performance/interpretability reasons.

So I went ahead and wrote up an alternative implementation based on tree boosting and conditional quantiles rather than neural networks and Gaussian mixtures.

The main class of interest is NonparametricIV, implemented in npiv/nonparametric_iv.py, which encapsulates the two stages described above.

1. the stage1 model approximates $F(x\mid z)$ via some number of predicted quantiles
2. the stage2 model targets either the true objective \eqref{trueobj} or the upper bound objective \eqref{upperobj} via several possible function approximators
• there are two function approximators available here:
• the stage2_objective initialization argument defines whether to use the true objective \eqref{trueobj} or the upper bound objective \eqref{upperobj}
• this gives us 4 different possible configurations of the stage 2 model, which are implemeted in NonparametricIV.train_stage2()
• the two cases with upper bound objective are relatively straightforward:
• the two cases with the true objective are more involved, as the expression in \eqref{trueobj} has a sum inside the square
• this requires some custom objectives which we implment in npiv/custom_objectives.py)
• these custom objective assume that there is a grp_col variable that indicates which rows of the stage2_data correspond to the same observation (recall that we generated e.g. 10 predicted quantiles for each observation, so that we have e.g. 10 rows for each observation)
• grouped_sse_loss() is what it sounds like: it computes the sum of squared errors, except that we sum the predicted and true y-values within all observations with a single group before taking the squared error, which is what the true objective \eqref{trueobj} requires
• grouped_sse_linear() just uses a linear function to generated the predicted y-value, then computed the grouped sse loss
• grouped_sse_loss_grad_hess() analytically computes the first and second derivatives of grouped_sse_loss() with respect to each predicted y-value
• true objective, linear approximator:
• this involves optimizing the custom objective grouped_sse_linear()
• we do so by relying on some gradient-free optimization method from scipy (by default Nelder-Mead)
• true objective, LightGBM approximator
• LightGBM just requires that we provide the gradient and hessian of the objective function wrt the predicted y-values, which is why we have this grouped_sse_loss_grad_hess() function
• we just call LightGBM with this custom objective, using grouped_sse_loss() as the corresponding custom metric

In addition, in order to assess how good our trained models are, we’ll need some functions for introspecting the behavior of these models

• for linear models, you can just… look the coefficients
• for nonparametric models, like the LightGBM models, we can do something similar, where we look at the marginal effects of each variable

#### II.2. Synthetic experiment setting

We’ll build a synthetic setting that is (1) realistic-ish, and (2) fairly simple, to function as a minimal test for our implementation of nonparametric/deep iv. The simulation setup is linear in both stage1 and stage2, so as to be as trivial as possible.

Implementation is in the IVSimulator class in npiv/iv_simulator.py

#### II.3. Some points, illustrated via synthetic data

We can now go about applying NonparametricIV to data generated from IVSimulator to illustrate some interesting points.

This notebook contains the setup/experiments/graphs/whatever for everything discussed below. The section numbers below will refer to sections in this notebook (because apparently I can’t link directly to sections of a notebook in github?).

Some preliminaries:

• data:
• we’ll do a super simple example with 2 exogenous covariates
• a total of 50000 randomly generated observations for training:
• 20000 for stage1: 16000 for training, 4000 for early stopping
• 20000 for stage2: 16000 for training, 4000 for recording validation performance
• this is a lot of data for an extremely trivial setup
• preliminary analysis via linear regression and 2SLS:
• as expected, linear regression of log_sales on exog_x_cols and log_price does a poor job of recovering the true coefficients, due to endogeneity (cell [6])
• as expected, 2 stage least squares (e.g. linear IV) works quite well, as the simulation setting is in fact linear in both stage1 and stage2

Now, some points:

• NonparametricIV works… kind of?
• section 1.a. : the stage1 models are extremely good, and predicts quantiles that are extremely concentrated around the truth
• this is certainly far better than what you would expect on real life problems
• section 1.b. : the stage2 model’s marginal effects don’t look too terrible? (we use the true objective, not the upper bound objective here, for reasons discussed below)
• the average marginal effect on the log_price is around -3, whereas the truth is -4
• for the exog_x_cols the marignal effects are around 0, whereas truth is around .7
• however, there’s fairly large variance around the average marginal effects around there mean, which is due to the trained stage2 lgb model being kind of very jagged (see cell 16 for the partial dependency plot) relative to the truth (which is linear and thus smooth)
• compared to ignoring endogeneity, this approach is cleary better, which is unsurprising because endogeneity is a pretty severe issue in this simulation
• compared to 2SLS it’s cleary worse, which is unsurprising because 2SLS makes very restrictive linearity assumptions that happen to be correct here
• stage2 model is EXTREMELY sensitive to small imperfections in the stage1 quantiles
• in section 2, we do NonparametrIV as above, except that instead of predicting the conditional quantiles, we directly use the true ones
• the training data is the same 16000 observations
• section 2.c.IV. has the analogue of section 1 above, except with true quantiles rather than predicted
• performance is better than in section 1: marginal effect of log_price is centered at -3.6 vs -3.1 in section 1, and marigal effects on exog_x_cols are also closer to the truth
• this is a bit surprising, given how incrediby good the stage1 quantile models in section 1 were, that going from those predictions to truth still produced large improvements in stage2 performance
• similarly if we take the true quantiles and add noise to them as in section 2.e., then performance predictably decreases
• this is thematically similar to a fairly key point from another economics/machine learning paper on estimating stuff in two stages, which requries L2 convergence of some stage1 estimate in order to get $\sqrt{n}$-consistent estimates of some stage2 parameter
• regularization is pretty hard to get right
• in both sections 1.b and 2.c.IV, the loss-minimizing number of rounds to train was very small (<1000), and the implied marginal effects computed at that point would have been much worse than the ones generated after training for 10000 rounds
• thus, performing early stopping via a validation set (as you would do in a typical supervised learning context) leads to significant bias
• however, the variance of the estimated marginal effects is also much smaller if we do early stopping
• this feels like a fairly standard example of bias-variance tradeoff
• in section 2.d. we increase a regularization parameter dramatically (min_gain_to_split), and then re-train the stage2 model
• this leads to biased marginal effects, e.g. on average around -2 for log_price rather than the true value of -4
• the validation loss function is actually lower here relative to 2.c.IV
• thus, it’s unclear if model selection via validation-set performance is a good approach here
• upper-bound objective is trivially inconsistent
• this is easy to see:
• in 2SLS, the stage2 just takes each observation and replaces the log_price and replaces it with $\mathbb{E}[$log_price $\mid$ exog_x_cols, instrument$]$
• in the stage2 of NonparametricIV, we’re producing multiple quantiles of (log_price $\mid$ exog_x_cols, instrument) instead of just the mean
• this is identical to taking the specification in stage2 of 2SLS and adding noise to $\mathbb{E}[$log_price $\mid$ exog_x_cols, instrument$]$
• this is errors in variables, which leads to attenuation bias
• see sections 2.c.I for the linear version, and 2.c.III for the LGB version; both produce quite bad results

#### II.4. Potential follow-up work

So, overall it looks like this core idea of breaking up nonparametric instrumental variables into two different supervised learning problems has some promise. Some points regarding future work:

• sharing information across quantiles
• right now, each quantile model is trained separately
• there may be some benefit to jointly estimating the conditional distribution of $F(x\mid z)$ instead
• the DeepIV implementation does something like this, but at the cost of imposing a somewhat more structured parametric form for this distribution
• maybe some post-processing shrinkage of the predicted quantiles so that e.g. implied standard deviation for similar observations don’t vary too much
• regularization / early stopping seems hard to get right in stage2, so maybe consider using some algorithm in stage2 that doesn’t overfit so easily
• so, probably some averaging-based method like random forests, which is somewhat more robust to overfitting than gradient boosting or neural networks
• didn’t do this here because the LightGBM random forest doesn’t appear to currently support custom objectives…
• also, sample splitting is a pretty reliable way to robustify stuff, so maybe do that as well in training these models (e.g. build trees using one subset of the data, fill in the leaf values with another)
• again, something that would take some more effort to implement
• as a minimal example, we’ve only evaluated this algorithm on a fairly trivial example here, so we can probably do some more examples to build intuition
• probably try more data
• probably try more covariates
• probably try some nonlinearities
• the custom objectives used here are very slow, as they’re implemented using aggregation functions in pandas see npiv/custom_objectives.py)
• this is fine for protyping, but taking 10 minutes to train a model on 16000 observations renders this fairly impractical for any real-life work
• so maybe modify LightGBM and directly implement these loss functions in C++