• suppose we have IID data \((x_i,y_i)_{i=1...N}\) with \(x_i\in \mathbb{R}^D\), \(y_i\in\mathbb{R}\)
  • we’re often interested in estimating some quantiles of the conditional distribution \(Y\mid x\).
    • as in, for some \(\alpha\in(0,1)\), we want to estimate this: \(\displaystyle \begin{equation} f_{\alpha}(x):= y \text{ s.t. } \mathbb{P}(Y<y\mid x)=\alpha \tag{1} \label{quantiledef} \end{equation}\)
  • all else being equal, we would prefer to more flexibly approximate \(f_\alpha\) with as opposed to e.g. putting restrictive assumptions (e.g. considering only linear functions).
  • one way of doing this flexible approximation that work fairly well in practice is gradient tree boosting
  • this method is quite popular for general function approximation purposes, and has some quite good implementations, one of which is LightGBM
  • however, it turns out the naive implementation of quantile regression for gradient boosting has some issues
  • we’ll:
    • describe what gradient boosting is and why it’s the way it is
    • discuss why quantile regression presents an issue for gradient boosting
    • look into how LightGBM dealt with it, and why they dealt with it that way

I. Background

Loss functions

Assume we have \((x_i,y_i)\) as in the intro, and that we want to estimate some statistic of the conditional distribution \(Y\mid x\), call it \(f(x)\). Often, there’s some loss function \(L\) associated with this \(f\), so that \(f\) minimizes the loss function \(L\) over the true distribution: \(\displaystyle \forall x, f(x) = \arg\min_{\tilde{y}}\mathbb{E}[L(\tilde{y}, Y) \mid x]\)

For concreteness, two leading examples:

  • If \(f(x)\) is \(\mathbb{E}[Y\mid x]\), then we can pick \(L(\tilde{y}, y) := (\tilde{y}-y)^2\) and satisfy the above relation
    • To see this, just note that \(L\) is convex and take the first order condition.
  • If \(f(x)\) is the quantile function \(f_{\alpha}(x)\) defined in the \eqref{quantiledef}, then we can use \(\displaystyle \begin{equation} L(\tilde{y}, y) = [\mathbb{1}_{\tilde{y}<y}\alpha + \mathbb{1}_{\tilde{y}\geq y}(1-\alpha) ] \vert\tilde{y}-y\vert \tag{2} \label{quantileloss} \end{equation}\)
    • This again follows by convexity of \(L\) and first order condition
    • For intuition, note that if \(\alpha\) is close to 1, then \(L\) essentially only penalizes situtations where the real \(y\) exceeds the prediction \(\tilde{y}\), so that the argmin \(\tilde{y}\) should be close to the max value that \(Y\) can take, which corresponds to a quantile of close to 1.

So, if we want to find \(\hat{f}(x)\) that approximates \(f(x)\), a quite reasonable thing to do is to try to pick \(\hat{f}\) out of some class of functions \(\mathcal{F}\) to minimize the corresponding empirical average (or equivalently, the sum) of \(L\) over our data:

\[\displaystyle \hat{f} = \arg\min_{\tilde{f}\in\mathcal{F}} \sum_i L\left(\tilde{f}(x_i), y_i\right)\]

Boosting

Alright, so let’s say we’re estimating \(\hat{f}\) by optimizing some empirical \(L\). We’ll typically want to do this estimation fairly flexibly, so we’ll want \(\mathcal{F}\) to be fairly complex (as opposed to e.g. considering only linear functions), but complex \(\mathcal{F}\) makes it harder to do the optimization. ‘Boosting’ is a general machine learning procedure that circumvents this issue by iteratively building up \(\hat{f}\) via a sum of ‘base learners’ ( = simple functions). This approach allows \(\mathcal{F}\) to be quite complex (as it’s a product of a bunch of simple function spaces), but at the cost of making the \(\arg\min\) approximate:

Boosting heuristic:

  1. Define a space of base learners \(\tilde{\mathcal{F}}\)
  2. For \(t=1,...,T\), (for some \(T\)), choose some base learner \(\displaystyle\hat{f}_t\) that incrementally improves our loss, e.g. \(\begin{equation} \displaystyle\hat{f}_t := \arg\min_{\tilde{f}\in\tilde{\mathcal{F}}}\sum_i L\left(\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i) + \tilde{f}(x_i), y_i\right) \label{baselearning0} \tag{3} \end{equation}\)
  3. Use \(\displaystyle\hat{f}^T:=\sum_{t=1}^T\hat{f}_t(x_i)\) to approximate the true \(f\)

So, boosting amounts to incrementally building up an estimate \(\hat{f}\) of \(f\) by doing a bunch of small improvements. Note that the model space here for \(\hat{f}\) can be highly complex, as sums of a bunch of simple functions can be quite complicated. However, the process involved in getting each \(\hat{f}_t(x_i)\) is still (by assumption) tractable.

Tree boosting

Typically in practice, the space of base learners \(\tilde{\mathcal{F}}\) is some set of binary tree with some constraint on the max depth or max number of nodes. Furthermore, the base learners \(\hat{f}_t\) are chosen by building the trees up greedily node-by-node, as opposed to globally optimized as in \eqref{baselearning0}, since doing that would be too computationally intensive.

Tree boosting:

  1. Define a space of base learners \(\tilde{\mathcal{F}}\) = all trees of a certain depth or node count
  2. For \(t=1,...,T\), (for some \(T\)), grow a tree \(\hat{f}_t\in \tilde{\mathcal{F}}\) by doing this:
    1. Start with a root node, and put all the data in this node. Let \(A\) denote this set of observations in this node.
      • Compute the optimal constant prediction \(\displaystyle \hat{y}_A := \arg\min_{\tilde{y}\in\mathbb{R}}\sum_{i\in A} L\left(\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i)+ \tilde{y} , y_i\right)\)
      • Record corresponding value of the loss \(\displaystyle l_A :=\sum_{i\in A} L\left(\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i) + \tilde{y}, y_i\right)\)
    2. Split the data in the node into two more nodes via this process:
      1. For each dimension of \(j\) of \(x\) (recall that \(x\) is \(D\)-dimensional), and for each cutoff \(c\in\mathbb{R}\) between the min and max values of \(x^j\) (where \(x^j\) denotes the \(j\)th dimension of \(x\)):
        • Use this \(c\) to split the data along dimension \(j\) into two sets \(B\) and \(C\), where \(B:=\{i\in A : x^j_i<c\}\) and \(C = A\backslash B\) constains the remaining observations
        • Find \(\hat{y}_B, l_B\), \(\hat{y}_C, l_C\) as we did for \(A\)
        • Compute the performance gain \(g(j, c) := l_A-(l_B+l_C)\), which is a measure of how much improvement there was to our loss function when we split the data along dimension \(j\) at value \(c\)
      2. Find the optimal \((j, c)\) by maximizing the performance gain \(g(j,c)\) over all possible \((j, c)\)s.
      3. If the optimal gain exceeds some minimum threshold, and our tree doesn’t have too many nodes already, use this \((j,c)\) to split \(A\) into \(B\) and \(C\), thus adding two more nodes to our tree
      4. Otherwise, we’re done
    3. Recursively apply the data-splitting process of step 2.2 to \(B\) and \(C\)
    4. Eventually, this process terminates
    5. The leaves of this resulting tree forms a partitions the domain of \(x\), and all data that falls within some a partition gets assigned some number \(\hat{y}\) computed during the tree building process
    6. This resulting piecewise constant function is \(\hat{f}_t\)
  3. Use \(\displaystyle\hat{f}^T:=\sum_{t=1}^T\hat{f}_t(x_i)\) to approximate the true \(f\)

The process of growing each tree \(\hat{f}_t\) amounts to repeated greedy splitting: first, see what our loss would be if we just had a constant predictor for everything in this node, and then compare it to the loss if we broke this node into two nodes and fit separate constants for each. This is greedy in the sense that it optimizes split selection without taking into account further-down-the-road splits that may occur.

Note that in order to determine the optimal splits, the algorithm tests every possible cutoff point \(c\) of every possible dimension \(j\), and for each \((j,c)\) pair, computes the optimal constants \(\hat{y}_B, \hat{y}_C\) and the corresponding losses \(l_B, l_C\) for the two partitions of the data \(B,C\). Generically, computing the loss on a set of observations \(B\) will require \(O(\lvert B \rvert)\) operations, as one would typically need to compute \(L(\hat{y}_B, y_i)\) for each observation \(i\in B\). So, if we assume that computing the optimal \(\hat{y}_B\) is also \(O(\lvert B\rvert)\), the time complexity of of growing the entire tree would be \(\displaystyle O\left(\sum_{h=1}^H\sum_{B\in h-\text{level nodes}} \lvert B\rvert DK\right) = O(HNDK)\), and thus the time complexity of all the splitting needed for the entire boosting procedure would be \(\displaystyle O(THNDK)\), where

  • \(T\) is the number of trees grown
  • \(H\) is max height of the tree
  • \(D\) is dimensionaly of \(x\) = number of features
  • \(K\) is the number of cutoffs points to consider for each feature
    • this amounts to discretizing each dimension of \(x\) into \(K\) distinct values
    • e.g \(K=255\) by default in LightGBM
  • \(N\) is number of observations

Gradient tree boosting

The fact that the splitting procedure 2.2 above appears to require re-computing the loss functions for every dimension of \(x\) and for every potential cutoff point seems like it would somewhat limit the applicability of tree boosting, espcially for large data sets. Probably the main innovation of gradient boosting as implemented in e.g. XGBoost/LightGBM/others is how to resolve this issue by using a second-order approximation of the loss function \(L\). Certainly, the fact that these implementations run quite quickly is a major reason for their popularity.

Letting \(\displaystyle\hat{f}^t := \sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i)\), for some set of observations \(B\) we can write

\[\begin{align} &\sum_{i\in B}L\left(\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i) + \tilde{y}, y_i\right) = \sum_{i\in B}L\left(\hat{f}^t(x_i) + \tilde{y}, y_i\right) \newline &\approx \sum_{i\in B}\left(L\left(\hat{f}^t(x_i) , y_i\right) + \partial_1L\left(\hat{f}^t(x_i) , y_i\right)\tilde{y} + \frac{1}{2}\partial_1^2L\left(\hat{f}^t(x_i) , y_i\right)\tilde{y}^2 \right) \end{align}\]

From which it follows that the optimal choice for \(\tilde{y}\) would be

\[\begin{equation} \hat{y}_B = -\frac{\sum_{i\in B}\partial_1L\left(\hat{f}^t(x_i) , y_i\right)}{\sum_{i\in B}\partial_1^2L\left(\hat{f}^t(x_i) , y_i\right)} \label{gbtyhat} \tag{4} \end{equation}\]

and the corresponding approximate loss

\[\begin{align} l_B &= \sum_{i\in B}L\left(\hat{f}^t(x_i) + \tilde{y}, y_i\right) \newline &\approx \sum_{i\in B}L\left(\hat{f}^t(x_i) , y_i\right) + \sum_{i\in B}\partial_1L\left(\hat{f}^t(x_i) , y_i\right)\hat{y}_B + \frac{1}{2}\sum_{i\in B}\partial_1^2L\left(\hat{f}^t(x_i) , y_i\right)\hat{y}_B^2 \newline &=\sum_{i\in B}L\left(\hat{f}^t(x_i) , y_i\right) - \frac{1}{2}\frac{\sum_{i\in B} \partial_1L\left(\hat{f}^t(x_i) , y_i\right)^2}{\sum_{i\in B}\partial_1^2L\left(\hat{f}^t(x_i) , y_i\right)} \label{gbtloss} \tag{5} \end{align}\]

which implies that the gain in splitting a set of observations \(A\) into two parts, \(B\) and \(C\) would purely be a function of the first and second derivatives of \(L\) \(\sum_{i\in B} \partial_1L\left(\hat{f}^t(x_i) , y_i\right)\) and \(\sum_{i\in B}\partial_1^2L\left(\hat{f}^t(x_i) , y_i\right)\).

Note that this allows the algorithm to just take a single pass through the data for each dimension of \(x\), rather than having to do so for each (dimension, cutoff point) pair. Let \(j\) be the dimension and \(c\) the cutpoint, and let \(B:=\{i\in A : x^j_i<c\}\) and \(C:= A\backslash C\). Suppose we’ve calculated the corresponding sum over hessians and gradients for observations in \(B\) and \(C\) (from which computing the \(\hat{y}\) and \(l\) are \(O(1)\)). Then, when we move onto the next-largest cutpoint \(c'\), the corresponding \(B'\) will differ from \(B\) only in a few additional points, and so that the first and second derivatives can be updated by simply adding the values of these new points. As the cutoff points go from smallest to biggest, extra observations are incrementally added to \(B\), so that only a single pass through the data is required for each dimension of \(x\).

So, given that this formulation of the problem allows a single pass through the data per dimension, as opposed to every (dimesion, cutoff) pair, the set of all the splitting needed for building \(T\) trees becomes \(O(THND)\) instead of the \(O(THNDK)\) above. This is significant, as \(K\geq255\).

LightGBM’s implementation

This is implemented in LightGBM in the FindBestThresholdSequence() function.

  • see lines 565 onwards for the implementation described above
    • sum_left_gradient, sum_left_hessian refer to \(\sum_{i\in B} \partial_1L\left(\hat{f}^t(x_i) , y_i\right)\) and \(\sum_{i\in B}\partial_1^2L\left(\hat{f}^t(x_i) , y_i\right)\)
    • lines 585 onwards iterates through all the relevant cutoff points
    • lines 589-593 performs the incremental updating of the sum_left_gradient and sum_left_hessian
    • lines 595, 599, 603 are various regularizations that terminate the algorithm when there’s not enough data
    • lines 601 and 605 compute the sum_right_gradient and sum_right_hessian by just subtracting the corresponding left hessian/gradient from the total sum
  • CalculateSplittedLeafOutput() computes the best constant prediction for a given set of observations, denoted by \(\hat{y}\) above, though there are differences due to some regularization that we didn’t discuss above
    • L1 regularization, which amounts to zeroing out too small \(\hat{y}\)
    • L2 regularization, which amounts to increasing the hessian, which shrinks \(\hat{y}\) towards 0
    • max_delta_step, which prevents the model from updating too quickly by limiting the max size of \(\hat{y}\)
  • GetSplitGains() computes the split gain information, corresponding to \(l_A-(l_B+l_C)\) above
    • this calls GetLeafSplitGainGivenOutput() in order to compute the left and right losses (which correspond to \(l_B\) and \(l_C\) above)
    • the implementation of GetLeafSplitGainGivenOutput() corresponds to \eqref{gbtloss} above, with adjustments for regularization
      • note that this function ignores the \(\sum_{i}L\left(\hat{f}^t(x_i) , y_i\right)\) in \eqref{gbtloss}, because when computing the gain \(l_A-(l_B+l_C)\), this ends up being canceled out

II. Quantile regression

Gradient boosting relies on approximating the true objective \(L\) with a second-order expansion, and then picking the splits and predictions in order to optimize this approximate loss. Thus it seems plausible that if the seconr-order approximation to \(L\) is bad, the quality of our predictions may suffer. This is exactly the issue with quantile regression.

In order for \(L\) to be well approximated by its second-order example, the local curvature of \(L\) has to contains some information about where \(L\) is optimized. Unfortunately, this is not the case for the quantile loss in \eqref{quantileloss}. This loss function has uniformly zero second derivatives, which provide zero information about where it is optimized. Compare this with e.g. a quadratic loss function, where if you know the local curvature and gradient you can reconstruct the coefficients on the quadratic and thus back out exactly where the function is minimized.

However, note that the quantile loss \eqref{quantileloss} does have well-defined derivatives almost everywhere

  • \(\partial_1L(\tilde{y}, y)\) is either \(-\alpha\) or \(1-\alpha\) depending on \(\tilde{y}<y\) or \(>\)
  • \(\partial_1^2L(\tilde{y}, y)\) is 0
  • at \(\tilde{y}=y\), both derivatives are undefined (but this is measure 0 so it doesn’t matter)

As a result, you can just… literally plug these into \eqref{gbtyhat} and \eqref{gbtloss} to get the relevant predictions \(\hat{y}\) and losses \(l\) (though you would have to add some L2 regularization so that the denominator isn’t uniformly 0).

Until early this year, LightGBM’s quantile regression was essentially this (or some slight variant). Then some people started noticing that this was resulting in poor performance, and the devs pushed some changes that appear to have improved performance significantly. The fix was to grow the trees as above, but then when the tree is grown, actually go back and re-compute the values in each leaf so that they minimize the true loss function \eqref{quantileloss} instead of the second order approximation.

To be a bit more precise, what LightGBM does for quantile regression is:

  • grow the tree as in the standard gradient boosting case
  • after a tree is grown, we have a bunch of leaves of this tree
  • these leaves partition our data into a bunch of regions
  • there is some constant \(\hat{y}_B\) for each partition \(B\), computed via \eqref{gbtyhat}, which minimize the second-order approximation of the true quantile loss function
  • discard these constants
  • now, for each leaf, find the constant that actually minimizes the loss function \eqref{quantileloss} instead of the second-order approximation (explained below)
  • use these new \(\hat{y}_B\)s instead of the old ones
  • the resulting \(\hat{f}_k\) is a piecewise constant with values given by these new \(\hat{y}_B\)s

Computing the \(\hat{y}_B\) that minimizes the true loss \eqref{quantileloss} over some set of observations \(B\) is just a matter of finding the empirical quantile. As in, using the \(L\) from \eqref{quantileloss}, we have:

\[\begin{align} &\sum_{i\in B}L\left(\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i) + \tilde{y}, y_i\right)\newline &= \left[\mathbb{1}_{\left(\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i) + \tilde{y}<y\right)}\alpha + \mathbb{1}_{\left(\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i) + \tilde{y}\geq y\right)}(1-\alpha) \right] \left\lvert\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i) + \tilde{y}-y\right\rvert \newline &= \sum_{i\in B}L\left(\tilde{y}, y_i-\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i)\right) \end{align}\]

so that \(\arg\min_{\tilde{y}}\sum_{i\in B}L\left(\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i) + \tilde{y}, y_i\right)\) is just the empirical \(\alpha\)-quantile of \(y_i-\sum_{t^\prime<t}\hat{f}_{t^\prime}(x_i)\) computed over the set of observations \(i\in B\). Thus, re-computing the optimal \(\hat{y}_B\) in a given leaf amounts to just finding an empirical quantile.

This is a quite reasonable approach from a time-complexity perspective:

  • this only adds \(N\) operations per tree
    • finding the constant that minimizes the true quantile loss over a set of observations \(B\) can be done in \(O(\lvert B\rvert)\) time via any \(O(n)\) selection algorithm
    • so, total complexity of finding the optimal values for all the leaves is just \(O(N)\), since the leaves partition our data
  • so, in total we need to perform \(TN\) operations over the entire tree boosting proces
  • which is small relative to the \(O(THND)\) complexity of all the splitting we need to do anyways
  • compare this to using the true loss function for splitting, which would slow us back down to the non-gradient tree boosting case

Even absent time considerations, this is not that unreasonable

  • during the tree-growing process we’re using a second-order approximate loss function instead of the true one
  • this approximate loss can be seen as a noised up version of the true loss
  • so, the best split under this approximate loss might not correspond to the best split under the true loss
  • however, the splitting procedure is greedy, so even with the true loss it’s not going to be globally optimal anyways
  • furthermore, we generally force the tree to select non-greedily-optimal splits by randomly restricting the set of observations or dimensions to split on, as a form of regularization
  • thus, using the approximate loss is effectively just another form of regularization during the tree growing process
  • we only optimize the true loss at the leaves, since leaves are the only places where the actual predicted values matter

Implementation in LightGBM:

This implementation is consistent with what we verbally described above: each tree is built normally, and after that finishes, the algorithm goes through each leaf and re-computes the optimal prediction in each leaf, which is just the leaf-wise empirical \(\alpha\)-quantile.

Does it actually work?

References