K-fold Cross-Validation in Python

$ \newcommand{\norm}[1]{\left\lVert#1\right\rVert} $ K-fold cross-validation has many different formulations and is mainly used for 2 things:

Comparing learning algorithms
Hyper-parameter tuning for a single learning algorithm

The goal is to approximate the prediction error of a learning algorithm. The main idea is to construct a K-partition of the data so that a method's prediction is evaluated on a single partition but trained on all remaining partitions. Typically, $K$ is chosen to be $5$ or $10$ and there is no general theoretical best $K$ to choose. Let $A_1, \ldots, A_M$ be learning algorithms and $(X_i, Y_i)$ be an i.i.d. sample of the data. Cross-validation can be used to compare different algorithms such as $A_1$ is a neural network, $A_2$ is a gradient boosted tree. Alternatively, cross-validation can be used to tune hyper-parameters such as $A_{\lambda_1}, \ldots, A_{\lambda_M}$ corresponding to the $\lambda$ tuning parameter in Lasso regression. The hyper-parameters can be evaluated over a grid or randomly sampled. If there are multiple tuning parameters, the grid grows exponentially in the dimension and random sampling is used. The algorithm corresponding to 1 partition of the data is Algorithm: K-fold Cross-validation Given learning algorithms $A_1, \ldots, A_M$, and $K$ as the number of folds. Partition the data into $K$ of approximately equal parts $\mathcal{P}_1, \ldots, \mathcal{P}_K$. (In parallel or not) for $k \in 1, \ldots, K$ (In parallel or not) for $i \in 1, \ldots, M$ $f_{A_i} \lvert_{\mathcal{P}_{-k}} \leftarrow$ Train method $A_i$ to all of the partitions except the $k$th. $L(y, f_{A_i} \lvert_{\mathcal{P}_{-k}}(X)) \rvert_{\mathcal{P}_{k}} \leftarrow$ Compute the loss over the kth partition. for $i \in 1, \ldots, M$ $L_{CV}(y, f_{A_i}(X)) \leftarrow \frac{1}{K} \sum_{k = 1}^K L(y, f_{A_i} \lvert_{\mathcal{P}_{-k}}(X)) \rvert_{\mathcal{P}_{k}}$

This algorithm is parallel by construction and the computational complexity is dependent on the training algorithms's complexity. K-fold cross-validation can be repeated multiple times and is called repeated K-fold cross-validation. Typically, the partitions $\mathcal{P}_1, \ldots, \mathcal{P}_K$ are sampled by permuting the dataset and cross-validation is run each time and the cross-validation errors are averaged.

Let us modify a nice argument by Larry Wasserman who always posts excellent lecture notes. Let $(Y_i, X_i)_{i = 1}^{nK}$ be an i.i.d. sample. Let $A_1, \ldots, A_M$ be the learning algorithms, and let $f_{A_i} \lvert_{\mathcal{D}}$ represent a training method fit to a sample of size $n$. Our goal is to estimate $L(y, f_{A^*} \lvert_{\mathcal{D}}(x))$ where $A^*$ is the learning algorithm with the minimum loss or prediction error. Now, the cross-validation estimate is unbiased since we partition out the data to compute the prediction error over \[ E \left( \frac{1}{K} \sum_{k = 1}^K \frac{1}{n} \sum_{i = 1}^n L(y_i, f_{A}(x_i) \lvert_{\mathcal{P}_n}) \right) = L(y, f_{A}(x) \lvert_{\mathcal{D}}). \] Suppose $\left| L(y, f_{A_i} \lvert_{\mathcal{D}_n}(x)) \right| \le R$. Now, maximizing the negative loss is the same as minimizing the loss. For me, it is easier to use Hoeffding's inequality with the maximum. Using a union bound along with Hoeffding's inequality, we have \[ P\left( \max_{1 \le m \le M} \left| \frac{1}{K} \sum_{k = 1}^K \frac{1}{n} \sum_{i = 1}^n -L(y_i, f_{A_m}(x_i) \lvert_{\mathcal{P}_n}) - -L(y, f_{A_m}(x) \lvert_{\mathcal{D}}) \right| \ge \epsilon \right) \le 2 K M \exp(\frac{-n \epsilon^2}{R^2}) \] Choose $\epsilon = \sqrt{\frac{R^2}{n} \log\left( \frac{2KM}{\alpha} \right)}$. We use the fact that the difference of maximums is less than the maximum of the differences. Then with high probability of at least $1 - \alpha$, we have \[ \min_{i \le m \le M} L_{CV}(y, f\lvert_{A_m}(X)) - \min_{i \le m \le M} L(y, f\lvert_{A_m}(X)) \le \sqrt{\frac{R^2}{n} \log\left( \frac{2KM}{\alpha} \right)}. \] Thus, the minimum cross-validation error of all the learning algorithms is close to the minimum prediction error of all the learning algorithms with high probability. Of course, this bound can be improved.

We implement a naive parallel K-fold cross-validation algorithm in Python for tuning the "C" tuning parameter in support vector machines. We use Scikit-learn for the SVM algorithm and the test set from the UCI optical handwritten digits dataset. We load the dataset and plot some of the digits.

import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
from sklearn import svm

test = np.loadtxt("data/optdigits_test.txt", delimiter = ",")
X = test[:, 0:64]
y = test[:, 64]

# Plot some of the digits
fig = plt.figure(figsize=(8, 6))
fig.tight_layout()
for i in range(0, 20):
    ax = fig.add_subplot(5, 5, i + 1)
    ax.imshow(X[i].reshape((8,8)), cmap = "Greys", vmin = 0, vmax = 16)
plt.show()

We implement a parallel 5-fold cross-validation in Python. We parallelize each $k$-fold by dividing the cross-validation function at each $k$ in the following function.

def cvK(X, y, tuning_params, partitions, k):
  n_tuning_params = tuning_params.shape[0]

  partition = partitions[k]
  TRAIN = np.delete(np.arange(0, X.shape[0]), partition)
  TEST = partition
  X_train = X[TRAIN, :]
  y_train = y[TRAIN]

  X_test = X[TEST, :]
  y_test = y[TEST]

  accuracies = np.zeros(n_tuning_params)
  for i in range(0, n_tuning_params):
    svc = svm.SVC(C = tuning_params[i], kernel = "linear")
    accuracies[i] = svc.fit(X_train, y_train).score(X_test, y_test)
  return accuracies

We then run each $k$ part of the cross-validation function in parallel and report the best tuning parameter from the cross-validation algorithm.

K = 5
tuning_params = np.logspace(-6, -1, 10)
partitions = np.array_split(np.random.permutation([i for i in range(0, X.shape[0])]), K)

pool = Pool()
args = [(X, y, tuning_params, partitions, k) for k in range(0, K)]
Accuracies = np.array(pool.starmap(cvK, args))
pool.close()

CV_accuracy = np.mean(Accuracies, axis = 0)
best_tuning_param = tuning_params[np.argmax(CV_accuracy)]
print(best_tuning_param)

0.0021544346900318843

which is the best tuning parameter chosen using 5-fold cross-validation.

References.

http://www.stat.cmu.edu/~larry/=stat705/Lecture16.pdf

Hoeffding, Wassily (1963). "Probability inequalities for sums of bounded random variables". Journal of the American Statistical Association. 58 (301): 13–30. doi:10.1080/01621459.1963.10500830. JSTOR 2282952. MR 0144363.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Tweets by austindavbrown

Austin David Brown Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories

References.

Austin David Brown
Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories