The exposition is derived from the XGBoost paper. Let Dn=(Xi,Yi)ni=1 be the i.i.d training sample. We wish to find an F∈F that minimizes the loss L:Rn×Rn→R+ where F is the function space of regression trees. Let f be a tree so that f(x)=∑Ll=1wlI(x∈Rl) where partitions (Rl) and weights (wl) correspond to the number of leaves L. We approximate a gradient step by using the Taylor expansion L(y,Fk+f)≈L(y,Fk)+gTf+12fTHf where g is the gradient vector and H is the Hessian matrix and solve for the optimial f which yields f=−H−1g. Now, we use this approximation to determine the weight computation and the splitting measure to construct the tree f. We fit each leaf with a constant weight and approximate H by a diagonal matrix diagi(hi). Let Il be the indices for the leaf l. Then the optimal weight in leaf l is w∗l=−∑i∈Ilgi∑i∈Ilhi and substituting for the optimal leaf loss yields L∗l(y,F+f)=L(y,Fk)−12(∑i∈Ilgi)2∑i∈Ilhi. Let I,IR,IL be the indices of a node and its left and right leaves respectively. We can ignore the constant term and approximate the change in optimal loss between the parent node and the two leaves by (∑i∈Igi)2∑i∈Ihi−(∑i∈ILgi)2∑i∈ILhi−(∑i∈IRgi)2∑i∈IRhi This may need to be modified to a weighted average depending on the loss. Let f0 be an initial approximation to y and η be the learning rate. We iteratively perform gradient descent in the space of regression trees by L(y,Fk+1(x))=L(y,Fk(x)+ηfk(x)). The algorithm is
For this implementation, we use the 2 class entropy loss with a sigmoid transformation σ:R→[0,1]: H(y,F)=n∑i=1yilog(σ(Fi))+(1−yi)log(1−σ(Fi)). The gradient and diagonal Hessian approximation are g=σ(F)−ydiag(hi)=diag(σ(Fi)(1−σ(Fi))).
We put this theory together in a naive implementation with Python and using some of the ideas from the structure of the XGBoost source. First, we implement a class for the loss function.
import numpy as np import matplotlib.pyplot as plt class BinaryEntropy(): @staticmethod def sigmoid(x): return np.exp(x) / (1 + np.exp(x)) # gradient @staticmethod def g(y_1, y_2): return BinaryEntropy.sigmoid(y_2) - y_1 # Hessian diagonal @staticmethod def H_diag(y_1, y_2): return BinaryEntropy.sigmoid(y_2) * (1 - BinaryEntropy.sigmoid(y_2)) @staticmethod def compute(y, p_pred): return -1/y.shape[0] * (np.sum(y * np.log(p_pred) + (1 - y) * (np.log(1 - p_pred)))) @staticmethod def approx(y, y_pred): eps = 10**(-16) return -1/2 * 1/max(np.sum(BinaryEntropy.H_diag(y, y_pred)), eps) * np.sum(BinaryEntropy.g(y, y_pred))**2 @staticmethod def compute_weight(y, y_pred): eps = 10**(-16) return -1/max(np.sum(BinaryEntropy.H_diag(y, y_pred)), eps) * np.sum(BinaryEntropy.g(y, y_pred)) @staticmethod def predTransform(w): return BinaryEntropy.sigmoid(w)Next, we modify the previous regression tree code to utilize the new loss class.
class GBTree(): @staticmethod def split_data(X, y, feature_index, feature_value): return { "I_left": np.where(X[:, feature_index] <= feature_value)[0], "I_right": np.where(X[:, feature_index] > feature_value)[0], } @staticmethod def greedy_best_split(X, y, y_pred, loss): best_feature_index = 0 best_split_value = 0 best_dloss = 0 best_split = { "I_left": np.array([]), "I_right": np.array([]), } n_features = X.shape[1] parent_loss = loss.approx(y, y_pred) for feature_index in range(0, n_features): split_values = np.unique(X[:, feature_index]) for split_value in split_values: split = GBTree.split_data(X, y, feature_index, split_value) # If there is a split if split["I_left"].shape[0] > 0 and split["I_right"].shape[0] > 0: # Compute loss change and update if the change in loss is the best so far dloss = parent_loss - loss.approx(y[split["I_left"]], y_pred[split["I_left"]]) - loss.approx(y[split["I_right"]], y_pred[split["I_right"]]) if dloss >= best_dloss: best_feature_index = feature_index best_split_value = split_value best_split = split best_dloss = dloss return best_dloss, best_feature_index, best_split_value, best_split @staticmethod def fit_tree(X, y, y_pred, loss, depth = 1, max_depth = 100, tolerance = 10**(-3)): node = {} # If we can split, find the best split by greedy algorithm if y.shape[0] >= 2: dloss, feature_index, split_value, split = GBTree.greedy_best_split(X, y, y_pred, loss) node["dloss"] = dloss node["feature_index"] = feature_index node["split_value"] = split_value # If there is a split and the stopping criterion is not met, branch 2 leaves if split["I_left"].shape[0] > 0 and split["I_right"].shape[0] > 0 and dloss >= tolerance and depth < max_depth: node["left"] = GBTree.fit_tree(X[split["I_left"]], y[split["I_left"]], y_pred[split["I_left"]], loss, depth = depth + 1, max_depth = max_depth, tolerance = tolerance) node["right"] = GBTree.fit_tree(X[split["I_right"]], y[split["I_right"]], y_pred[split["I_right"]], loss, depth = depth + 1, max_depth = max_depth, tolerance = tolerance) # Terminal node else: node["w"] = loss.compute_weight(y, y_pred) node["left"] = None node["right"] = None # Terminal node else: node["w"] = loss.compute_weight(y, y_pred) node["left"] = None node["right"] = None return node @staticmethod def predict_one(node, x): if node["left"] == None or node["right"] == None: return node["w"] else: if x[node["feature_index"]] <= node["split_value"]: return GBTree.predict_one(node["left"], x) else: return GBTree.predict_one(node["right"], x) @staticmethod def predict(node, loss, X): n_samples = X.shape[0] predictions = np.zeros(n_samples) for i in range(0, n_samples): predictions[i] = GBTree.predict_one(node, X[i]) return predictions @staticmethod def print_tree(node, depth = 0): if node["left"] == None or node["right"] == None: print(f'{depth * " "}weight: {node["w"]}') else: print(f'{depth * " "}X{node["feature_index"]} <= {node["split_value"]}') GBTree.print_tree(node["left"], depth + 1) GBTree.print_tree(node["right"], depth + 1)Finally, we construct the gradient boosting class.
class GradientBoost(): def __init__(self, loss, learning_rate = 1, max_depth = 10, tolerance = 10**(-3), max_iter = 10): self.trees = [] self.loss = loss self.learning_rate = learning_rate self.max_depth = max_depth self.tolerance = tolerance self.max_iter = max_iter def train(self, X, y): y_pred = np.zeros(y.shape[0]) for i in range(0, self.max_iter): tree = GBTree.fit_tree(X, y, y_pred, self.loss, max_depth = self.max_depth, tolerance = self.tolerance) self.trees.append(tree) y_pred = self.learning_rate * GBTree.predict(tree, self.loss, X) print("Training Loss:", self.loss.compute(y, self.predict(X))) def predict(self, X): n_samples = X.shape[0] y_pred = np.zeros(n_samples) for i in range(0, len(self.trees)): y_pred += self.learning_rate * GBTree.predict(self.trees[i], self.loss, X) return self.loss.predTransform(y_pred)We use the UCI optical handwritten digits dataset. Only the test set is used and the classification is reduced to only the 1's digit. Finally, we split this dataset into a training and test set. Here is a plot of some digits for reference.
test = np.loadtxt("data/optdigits_test.txt", delimiter = ",") X = test[:, 0:64] y = test[:, 64] # Convert to binary classification y[y != 1] = 0 # Train/test split n_samples = X.shape[0] n_TRAIN = int(.75 * n_samples) I = np.arange(0, n_samples) TRAIN = np.random.choice(I, n_TRAIN, replace = False) TEST = np.setdiff1d(I, TRAIN) X_train = X[TRAIN, :] y_train = y[TRAIN] X_test = X[TEST, :] y_test = y[TEST]We train the gradient boosting classifier with the binary entropy loss class and report the prediction accuracy on the training and test sets respectively.
gb = GradientBoost(loss = BinaryEntropy()) gb.train(X_train, y_train) print("Train accuracy:", 1/X_train.shape[0] * np.sum((np.round(gb.predict(X_train)) == y_train).astype(int))) print("Test accuracy", 1/X_test.shape[0] * np.sum((np.round(gb.predict(X_test)) == y_test).astype(int)))
Training Loss: 0.1269280110429725 Training Loss: 0.04256623711865767 Training Loss: 0.011534388597255378 Training Loss: 0.003263137623967758 Training Loss: 0.0009067682950024066 Training Loss: 0.0002528152405255996 Training Loss: 7.038866445512709e-05 Training Loss: 1.9602627915503846e-05 Training Loss: 5.458568180889396e-06 Training Loss: 1.5200290017234747e-06 Train accuracy: 1.0 Test accuracy 0.9866666666666667
References.
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+DigitsTianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 785-794. DOI: https://doi.org/10.1145/2939672.2939785
Friedman, Jerome H., Trevor J. Hastie and Robert Tibshirani. “Additive Logistic Regression : a Statistical View of Boosting.” (1998).
Jerome H. Friedman. 2002. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 4 (February 2002), 367-378. DOI=http://dx.doi.org/10.1016/S0167-9473(01)00065-2
Tweets by austindavbrown