Processing math: 100%

Austin David Brown
Google scholar   |   Curriculum vitae   |   Github   |   arXiv   |   Linkedin   |   Posts   |   Categories

Gradient Boosting in Python


Previously, we constructed regression trees and found that single trees predict sub-optimally. Gradient boosting is a supervised learning algorithm. Originally, boosting was derived as an ensemble method of weak-learners and later Friedman derived gradient boosting in terms of gradient descent in a function space. Friedman's excellent paper is the interpretation we will follow. We implement a naive gradient boosting algorithm for binary classification using the implementation from XGBoost that was originally published by Friedman. A softmax implementation would be more complicated. However, one could alternatively fit multiple binary classification gradient boosting algorithms to perform multiple classification.

The exposition is derived from the XGBoost paper. Let Dn=(Xi,Yi)ni=1 be the i.i.d training sample. We wish to find an FF that minimizes the loss L:Rn×RnR+ where F is the function space of regression trees. Let f be a tree so that f(x)=Ll=1wlI(xRl) where partitions (Rl) and weights (wl) correspond to the number of leaves L. We approximate a gradient step by using the Taylor expansion L(y,Fk+f)L(y,Fk)+gTf+12fTHf where g is the gradient vector and H is the Hessian matrix and solve for the optimial f which yields f=H1g. Now, we use this approximation to determine the weight computation and the splitting measure to construct the tree f. We fit each leaf with a constant weight and approximate H by a diagonal matrix diagi(hi). Let Il be the indices for the leaf l. Then the optimal weight in leaf l is wl=iIlgiiIlhi and substituting for the optimal leaf loss yields Ll(y,F+f)=L(y,Fk)12(iIlgi)2iIlhi. Let I,IR,IL be the indices of a node and its left and right leaves respectively. We can ignore the constant term and approximate the change in optimal loss between the parent node and the two leaves by (iIgi)2iIhi(iILgi)2iILhi(iIRgi)2iIRhi This may need to be modified to a weighted average depending on the loss. Let f0 be an initial approximation to y and η be the learning rate. We iteratively perform gradient descent in the space of regression trees by L(y,Fk+1(x))=L(y,Fk(x)+ηfk(x)). The algorithm is Algorithm: Gradient Boost  Choose initial value f0 and learning rate η. repeat until convergence f Fit a Regression tree to the approximated loss L(y,fk+f)  fk+1fk+ηf

For this implementation, we use the 2 class entropy loss with a sigmoid transformation σ:R[0,1]: H(y,F)=ni=1yilog(σ(Fi))+(1yi)log(1σ(Fi)). The gradient and diagonal Hessian approximation are g=σ(F)ydiag(hi)=diag(σ(Fi)(1σ(Fi))).

We put this theory together in a naive implementation with Python and using some of the ideas from the structure of the XGBoost source. First, we implement a class for the loss function.
import numpy as np
import matplotlib.pyplot as plt

class BinaryEntropy():
  @staticmethod
  def sigmoid(x):
    return np.exp(x) / (1 + np.exp(x))

  # gradient
  @staticmethod
  def g(y_1, y_2):
    return BinaryEntropy.sigmoid(y_2) - y_1

  # Hessian diagonal
  @staticmethod
  def H_diag(y_1, y_2):
    return BinaryEntropy.sigmoid(y_2) * (1 - BinaryEntropy.sigmoid(y_2))

  @staticmethod
  def compute(y, p_pred):
    return -1/y.shape[0] * (np.sum(y * np.log(p_pred) + (1 - y) * (np.log(1 - p_pred))))

  @staticmethod
  def approx(y, y_pred):
    eps = 10**(-16)
    return -1/2 * 1/max(np.sum(BinaryEntropy.H_diag(y, y_pred)), eps) * np.sum(BinaryEntropy.g(y, y_pred))**2

  @staticmethod
  def compute_weight(y, y_pred):
    eps = 10**(-16)
    return -1/max(np.sum(BinaryEntropy.H_diag(y, y_pred)), eps) * np.sum(BinaryEntropy.g(y, y_pred))

  @staticmethod
  def predTransform(w):
    return BinaryEntropy.sigmoid(w)
Next, we modify the previous regression tree code to utilize the new loss class.
class GBTree():
  @staticmethod
  def split_data(X, y, feature_index, feature_value):
    return {
      "I_left": np.where(X[:, feature_index] <= feature_value)[0],
      "I_right": np.where(X[:, feature_index] > feature_value)[0],
    }

  @staticmethod
  def greedy_best_split(X, y, y_pred, loss):
    best_feature_index = 0
    best_split_value = 0
    best_dloss = 0
    best_split = {
      "I_left": np.array([]),
      "I_right": np.array([]),
    }

    n_features = X.shape[1]
    parent_loss = loss.approx(y, y_pred)
    for feature_index in range(0, n_features):
      split_values = np.unique(X[:, feature_index])
      for split_value in split_values:
        split = GBTree.split_data(X, y, feature_index, split_value)

        # If there is a split
        if split["I_left"].shape[0] > 0 and split["I_right"].shape[0] > 0:
          # Compute loss change and update if the change in loss is the best so far
          dloss = parent_loss - loss.approx(y[split["I_left"]], y_pred[split["I_left"]]) - loss.approx(y[split["I_right"]], y_pred[split["I_right"]])
          if dloss >= best_dloss:
            best_feature_index = feature_index
            best_split_value = split_value
            best_split = split
            best_dloss = dloss
    return best_dloss, best_feature_index, best_split_value, best_split

  @staticmethod
  def fit_tree(X, y, y_pred, loss, depth = 1, 
               max_depth = 100, tolerance = 10**(-3)):
    node = {}

    # If we can split, find the best split by greedy algorithm
    if y.shape[0] >= 2:
      dloss, feature_index, split_value, split = GBTree.greedy_best_split(X, y, y_pred, loss)
      node["dloss"] = dloss
      node["feature_index"] = feature_index
      node["split_value"] = split_value

      # If there is a split and the stopping criterion is not met, branch 2 leaves
      if split["I_left"].shape[0] > 0 and split["I_right"].shape[0] > 0 and dloss >= tolerance and depth < max_depth:
        node["left"] = GBTree.fit_tree(X[split["I_left"]], y[split["I_left"]], y_pred[split["I_left"]], loss,
                                     depth = depth + 1, max_depth = max_depth, tolerance = tolerance)
        node["right"] = GBTree.fit_tree(X[split["I_right"]], y[split["I_right"]], y_pred[split["I_right"]], loss,
                                      depth = depth + 1, max_depth = max_depth, tolerance = tolerance) 
      # Terminal node
      else:
        node["w"] = loss.compute_weight(y, y_pred)
        node["left"] = None
        node["right"] = None
    # Terminal node    
    else:
      node["w"] = loss.compute_weight(y, y_pred)
      node["left"] = None
      node["right"] = None
    return node

  @staticmethod
  def predict_one(node, x):
    if node["left"] == None or node["right"] == None:
      return node["w"]
    else:
      if x[node["feature_index"]] <= node["split_value"]:
        return GBTree.predict_one(node["left"], x)
      else:
        return GBTree.predict_one(node["right"], x)

  @staticmethod
  def predict(node, loss, X):
    n_samples = X.shape[0]
    predictions = np.zeros(n_samples)
    for i in range(0, n_samples):
      predictions[i] = GBTree.predict_one(node, X[i])
    return predictions

  @staticmethod
  def print_tree(node, depth = 0):
    if node["left"] == None or node["right"] == None:
      print(f'{depth * "  "}weight: {node["w"]}')
    else:
      print(f'{depth * "  "}X{node["feature_index"]} <= {node["split_value"]}')
      GBTree.print_tree(node["left"], depth + 1)
      GBTree.print_tree(node["right"], depth + 1)
Finally, we construct the gradient boosting class.
class GradientBoost():
  def __init__(self, loss, learning_rate = 1, max_depth = 10, tolerance = 10**(-3), max_iter = 10):
    self.trees = []
    self.loss = loss
    self.learning_rate = learning_rate
    self.max_depth = max_depth
    self.tolerance = tolerance
    self.max_iter = max_iter

  def train(self, X, y):
    y_pred = np.zeros(y.shape[0])

    for i in range(0, self.max_iter):
      tree = GBTree.fit_tree(X, y, y_pred, self.loss,
                           max_depth = self.max_depth, tolerance = self.tolerance)
      self.trees.append(tree)
      y_pred = self.learning_rate * GBTree.predict(tree, self.loss, X)

      print("Training Loss:", self.loss.compute(y, self.predict(X)))

  def predict(self, X):
    n_samples = X.shape[0]
    
    y_pred = np.zeros(n_samples)
    for i in range(0, len(self.trees)):
      y_pred += self.learning_rate * GBTree.predict(self.trees[i], self.loss, X) 
    return self.loss.predTransform(y_pred)
We use the UCI optical handwritten digits dataset. Only the test set is used and the classification is reduced to only the 1's digit. Finally, we split this dataset into a training and test set. Here is a plot of some digits for reference.

test = np.loadtxt("data/optdigits_test.txt", delimiter = ",")
X = test[:, 0:64]
y = test[:, 64]

# Convert to binary classification
y[y != 1] = 0

# Train/test split
n_samples = X.shape[0]
n_TRAIN = int(.75 * n_samples)
I = np.arange(0, n_samples)
TRAIN = np.random.choice(I, n_TRAIN, replace = False)
TEST = np.setdiff1d(I, TRAIN)
X_train = X[TRAIN, :]
y_train = y[TRAIN]
X_test = X[TEST, :]
y_test = y[TEST]
We train the gradient boosting classifier with the binary entropy loss class and report the prediction accuracy on the training and test sets respectively.
gb = GradientBoost(loss = BinaryEntropy())
gb.train(X_train, y_train)

print("Train accuracy:", 1/X_train.shape[0] * np.sum((np.round(gb.predict(X_train)) == y_train).astype(int)))
print("Test accuracy", 1/X_test.shape[0] * np.sum((np.round(gb.predict(X_test)) == y_test).astype(int)))
Training Loss: 0.1269280110429725
Training Loss: 0.04256623711865767
Training Loss: 0.011534388597255378
Training Loss: 0.003263137623967758
Training Loss: 0.0009067682950024066
Training Loss: 0.0002528152405255996
Training Loss: 7.038866445512709e-05
Training Loss: 1.9602627915503846e-05
Training Loss: 5.458568180889396e-06
Training Loss: 1.5200290017234747e-06
Train accuracy: 1.0
Test accuracy 0.9866666666666667

References.

http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 785-794. DOI: https://doi.org/10.1145/2939672.2939785

Friedman, Jerome H., Trevor J. Hastie and Robert Tibshirani. “Additive Logistic Regression : a Statistical View of Boosting.” (1998).

Jerome H. Friedman. 2002. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 4 (February 2002), 367-378. DOI=http://dx.doi.org/10.1016/S0167-9473(01)00065-2