Gradient Boosting in Python | Austin David Brown

$ \newcommand{\norm}[1]{\left\lVert#1\right\rVert} $ Previously, we constructed regression trees and found that single trees predict sub-optimally. Gradient boosting is a supervised learning algorithm. Originally, boosting was derived as an ensemble method of weak-learners and later Friedman derived gradient boosting in terms of gradient descent in a function space. Friedman's excellent paper is the interpretation we will follow. We implement a naive gradient boosting algorithm for binary classification using the implementation from XGBoost that was originally published by Friedman. A softmax implementation would be more complicated. However, one could alternatively fit multiple binary classification gradient boosting algorithms to perform multiple classification.

The exposition is derived from the XGBoost paper. Let $\mathcal{D}_n = (X_i, Y_i)_{i = 1}^n$ be the i.i.d training sample. We wish to find an $F \in \mathcal{F}$ that minimizes the loss $L : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}_{+}$ where $\mathcal{F}$ is the function space of regression trees. Let $f$ be a tree so that $f(x) = \sum_{l =1}^L w_l I(x \in R_l)$ where partitions $(R_l)$ and weights $(w_l)$ correspond to the number of leaves $L$. We approximate a gradient step by using the Taylor expansion \[ L(y, F_k + f) \approx L(y, F_k) + g^T f + \frac{1}{2} f^T Hf \] where $g$ is the gradient vector and $H$ is the Hessian matrix and solve for the optimial $f$ which yields \[ f = -H^{-1} g. \] Now, we use this approximation to determine the weight computation and the splitting measure to construct the tree $f$. We fit each leaf with a constant weight and approximate $H$ by a diagonal matrix $diag_i(h_i)$. Let $I_l$ be the indices for the leaf $l$. Then the optimal weight in leaf $l$ is \[ w^*_l = -\frac{\sum_{i \in I_l} g_i}{\sum_{i \in I_l} h_i} \] and substituting for the optimal leaf loss yields \[ L_l^*(y, F + f) = L(y, F_k) -\frac{1}{2} \frac{ \left( \sum_{i \in I_l} g_i \right)^2 }{\sum_{i \in I_l} h_i}. \] Let $I, I_R, I_L$ be the indices of a node and its left and right leaves respectively. We can ignore the constant term and approximate the change in optimal loss between the parent node and the two leaves by \[ \frac{ \left( \sum_{i \in I} g_i \right)^2 }{\sum_{i \in I} h_i} - \frac{ \left( \sum_{i \in I_L} g_i \right)^2 }{\sum_{i \in I_L} h_i} - \frac{ \left( \sum_{i \in I_R} g_i \right)^2 }{\sum_{i \in I_R} h_i} \] This may need to be modified to a weighted average depending on the loss. Let $f_0$ be an initial approximation to $y$ and $\eta$ be the learning rate. We iteratively perform gradient descent in the space of regression trees by \[ L(y, F^{k + 1}(x)) = L\left( y, F^{k}(x) + \eta f^k(x) \right). \] The algorithm is Algorithm: Gradient Boost Choose initial value $f_0$ and learning rate $\eta$. repeat until convergence $f \leftarrow$ Fit a Regression tree to the approximated loss $L(y, f_{k} + f)$ $f_{k + 1} \leftarrow f_k + \eta f$

For this implementation, we use the 2 class entropy loss with a sigmoid transformation $\sigma: \mathbb{R} \to [0, 1]$: \[ H(y, F) = \sum_{i = 1}^n y_i \log(\sigma(F_i)) + (1 - y_i) \log(1 - \sigma(F_i)). \] The gradient and diagonal Hessian approximation are \begin{align*} &g = \sigma(F) - y \\ &diag(h_i) = diag\left( \sigma(F_i) (1 - \sigma(F_i)) \right). \end{align*}

We put this theory together in a naive implementation with Python and using some of the ideas from the structure of the XGBoost source. First, we implement a class for the loss function.

import numpy as np
import matplotlib.pyplot as plt

class BinaryEntropy():
  @staticmethod
  def sigmoid(x):
    return np.exp(x) / (1 + np.exp(x))

  # gradient
  @staticmethod
  def g(y_1, y_2):
    return BinaryEntropy.sigmoid(y_2) - y_1

  # Hessian diagonal
  @staticmethod
  def H_diag(y_1, y_2):
    return BinaryEntropy.sigmoid(y_2) * (1 - BinaryEntropy.sigmoid(y_2))

  @staticmethod
  def compute(y, p_pred):
    return -1/y.shape[0] * (np.sum(y * np.log(p_pred) + (1 - y) * (np.log(1 - p_pred))))

  @staticmethod
  def approx(y, y_pred):
    eps = 10**(-16)
    return -1/2 * 1/max(np.sum(BinaryEntropy.H_diag(y, y_pred)), eps) * np.sum(BinaryEntropy.g(y, y_pred))**2

  @staticmethod
  def compute_weight(y, y_pred):
    eps = 10**(-16)
    return -1/max(np.sum(BinaryEntropy.H_diag(y, y_pred)), eps) * np.sum(BinaryEntropy.g(y, y_pred))

  @staticmethod
  def predTransform(w):
    return BinaryEntropy.sigmoid(w)

Next, we modify the previous regression tree code to utilize the new loss class.

class GBTree():
  @staticmethod
  def split_data(X, y, feature_index, feature_value):
    return {
      "I_left": np.where(X[:, feature_index] <= feature_value)[0],
      "I_right": np.where(X[:, feature_index] > feature_value)[0],
    }

  @staticmethod
  def greedy_best_split(X, y, y_pred, loss):
    best_feature_index = 0
    best_split_value = 0
    best_dloss = 0
    best_split = {
      "I_left": np.array([]),
      "I_right": np.array([]),
    }

    n_features = X.shape[1]
    parent_loss = loss.approx(y, y_pred)
    for feature_index in range(0, n_features):
      split_values = np.unique(X[:, feature_index])
      for split_value in split_values:
        split = GBTree.split_data(X, y, feature_index, split_value)

        # If there is a split
        if split["I_left"].shape[0] > 0 and split["I_right"].shape[0] > 0:
          # Compute loss change and update if the change in loss is the best so far
          dloss = parent_loss - loss.approx(y[split["I_left"]], y_pred[split["I_left"]]) - loss.approx(y[split["I_right"]], y_pred[split["I_right"]])
          if dloss >= best_dloss:
            best_feature_index = feature_index
            best_split_value = split_value
            best_split = split
            best_dloss = dloss
    return best_dloss, best_feature_index, best_split_value, best_split

  @staticmethod
  def fit_tree(X, y, y_pred, loss, depth = 1, 
               max_depth = 100, tolerance = 10**(-3)):
    node = {}

    # If we can split, find the best split by greedy algorithm
    if y.shape[0] >= 2:
      dloss, feature_index, split_value, split = GBTree.greedy_best_split(X, y, y_pred, loss)
      node["dloss"] = dloss
      node["feature_index"] = feature_index
      node["split_value"] = split_value

      # If there is a split and the stopping criterion is not met, branch 2 leaves
      if split["I_left"].shape[0] > 0 and split["I_right"].shape[0] > 0 and dloss >= tolerance and depth < max_depth:
        node["left"] = GBTree.fit_tree(X[split["I_left"]], y[split["I_left"]], y_pred[split["I_left"]], loss,
                                     depth = depth + 1, max_depth = max_depth, tolerance = tolerance)
        node["right"] = GBTree.fit_tree(X[split["I_right"]], y[split["I_right"]], y_pred[split["I_right"]], loss,
                                      depth = depth + 1, max_depth = max_depth, tolerance = tolerance) 
      # Terminal node
      else:
        node["w"] = loss.compute_weight(y, y_pred)
        node["left"] = None
        node["right"] = None
    # Terminal node    
    else:
      node["w"] = loss.compute_weight(y, y_pred)
      node["left"] = None
      node["right"] = None
    return node

  @staticmethod
  def predict_one(node, x):
    if node["left"] == None or node["right"] == None:
      return node["w"]
    else:
      if x[node["feature_index"]] <= node["split_value"]:
        return GBTree.predict_one(node["left"], x)
      else:
        return GBTree.predict_one(node["right"], x)

  @staticmethod
  def predict(node, loss, X):
    n_samples = X.shape[0]
    predictions = np.zeros(n_samples)
    for i in range(0, n_samples):
      predictions[i] = GBTree.predict_one(node, X[i])
    return predictions

  @staticmethod
  def print_tree(node, depth = 0):
    if node["left"] == None or node["right"] == None:
      print(f'{depth * "  "}weight: {node["w"]}')
    else:
      print(f'{depth * "  "}X{node["feature_index"]} <= {node["split_value"]}')
      GBTree.print_tree(node["left"], depth + 1)
      GBTree.print_tree(node["right"], depth + 1)

Finally, we construct the gradient boosting class.

class GradientBoost():
  def __init__(self, loss, learning_rate = 1, max_depth = 10, tolerance = 10**(-3), max_iter = 10):
    self.trees = []
    self.loss = loss
    self.learning_rate = learning_rate
    self.max_depth = max_depth
    self.tolerance = tolerance
    self.max_iter = max_iter

  def train(self, X, y):
    y_pred = np.zeros(y.shape[0])

    for i in range(0, self.max_iter):
      tree = GBTree.fit_tree(X, y, y_pred, self.loss,
                           max_depth = self.max_depth, tolerance = self.tolerance)
      self.trees.append(tree)
      y_pred = self.learning_rate * GBTree.predict(tree, self.loss, X)

      print("Training Loss:", self.loss.compute(y, self.predict(X)))

  def predict(self, X):
    n_samples = X.shape[0]
    
    y_pred = np.zeros(n_samples)
    for i in range(0, len(self.trees)):
      y_pred += self.learning_rate * GBTree.predict(self.trees[i], self.loss, X) 
    return self.loss.predTransform(y_pred)

We use the UCI optical handwritten digits dataset. Only the test set is used and the classification is reduced to only the 1's digit. Finally, we split this dataset into a training and test set. Here is a plot of some digits for reference.

test = np.loadtxt("data/optdigits_test.txt", delimiter = ",")
X = test[:, 0:64]
y = test[:, 64]

# Convert to binary classification
y[y != 1] = 0

# Train/test split
n_samples = X.shape[0]
n_TRAIN = int(.75 * n_samples)
I = np.arange(0, n_samples)
TRAIN = np.random.choice(I, n_TRAIN, replace = False)
TEST = np.setdiff1d(I, TRAIN)
X_train = X[TRAIN, :]
y_train = y[TRAIN]
X_test = X[TEST, :]
y_test = y[TEST]

We train the gradient boosting classifier with the binary entropy loss class and report the prediction accuracy on the training and test sets respectively.

gb = GradientBoost(loss = BinaryEntropy())
gb.train(X_train, y_train)

print("Train accuracy:", 1/X_train.shape[0] * np.sum((np.round(gb.predict(X_train)) == y_train).astype(int)))
print("Test accuracy", 1/X_test.shape[0] * np.sum((np.round(gb.predict(X_test)) == y_test).astype(int)))

Training Loss: 0.1269280110429725
Training Loss: 0.04256623711865767
Training Loss: 0.011534388597255378
Training Loss: 0.003263137623967758
Training Loss: 0.0009067682950024066
Training Loss: 0.0002528152405255996
Training Loss: 7.038866445512709e-05
Training Loss: 1.9602627915503846e-05
Training Loss: 5.458568180889396e-06
Training Loss: 1.5200290017234747e-06
Train accuracy: 1.0
Test accuracy 0.9866666666666667

References.

http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 785-794. DOI: https://doi.org/10.1145/2939672.2939785

Friedman, Jerome H., Trevor J. Hastie and Robert Tibshirani. “Additive Logistic Regression : a Statistical View of Boosting.” (1998).

Jerome H. Friedman. 2002. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 4 (February 2002), 367-378. DOI=http://dx.doi.org/10.1016/S0167-9473(01)00065-2

Tweets by austindavbrown

Austin David Brown Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories

References.

Austin David Brown
Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories