- Comparing learning algorithms
- Hyper-parameter tuning for a single learning algorithm
This algorithm is parallel by construction and the computational complexity is dependent on the training algorithms's complexity. K-fold cross-validation can be repeated multiple times and is called repeated K-fold cross-validation. Typically, the partitions $\mathcal{P}_1, \ldots, \mathcal{P}_K$ are sampled by permuting the dataset and cross-validation is run each time and the cross-validation errors are averaged.
Let us modify a nice argument by Larry Wasserman who always posts excellent lecture notes. Let $(Y_i, X_i)_{i = 1}^{nK}$ be an i.i.d. sample. Let $A_1, \ldots, A_M$ be the learning algorithms, and let $f_{A_i} \lvert_{\mathcal{D}}$ represent a training method fit to a sample of size $n$. Our goal is to estimate $L(y, f_{A^*} \lvert_{\mathcal{D}}(x))$ where $A^*$ is the learning algorithm with the minimum loss or prediction error. Now, the cross-validation estimate is unbiased since we partition out the data to compute the prediction error over \[ E \left( \frac{1}{K} \sum_{k = 1}^K \frac{1}{n} \sum_{i = 1}^n L(y_i, f_{A}(x_i) \lvert_{\mathcal{P}_n}) \right) = L(y, f_{A}(x) \lvert_{\mathcal{D}}). \] Suppose $\left| L(y, f_{A_i} \lvert_{\mathcal{D}_n}(x)) \right| \le R$. Now, maximizing the negative loss is the same as minimizing the loss. For me, it is easier to use Hoeffding's inequality with the maximum. Using a union bound along with Hoeffding's inequality, we have \[ P\left( \max_{1 \le m \le M} \left| \frac{1}{K} \sum_{k = 1}^K \frac{1}{n} \sum_{i = 1}^n -L(y_i, f_{A_m}(x_i) \lvert_{\mathcal{P}_n}) - -L(y, f_{A_m}(x) \lvert_{\mathcal{D}}) \right| \ge \epsilon \right) \le 2 K M \exp(\frac{-n \epsilon^2}{R^2}) \] Choose $\epsilon = \sqrt{\frac{R^2}{n} \log\left( \frac{2KM}{\alpha} \right)}$. We use the fact that the difference of maximums is less than the maximum of the differences. Then with high probability of at least $1 - \alpha$, we have \[ \min_{i \le m \le M} L_{CV}(y, f\lvert_{A_m}(X)) - \min_{i \le m \le M} L(y, f\lvert_{A_m}(X)) \le \sqrt{\frac{R^2}{n} \log\left( \frac{2KM}{\alpha} \right)}. \] Thus, the minimum cross-validation error of all the learning algorithms is close to the minimum prediction error of all the learning algorithms with high probability. Of course, this bound can be improved.
We implement a naive parallel K-fold cross-validation algorithm in Python for tuning the "C" tuning parameter in support vector machines. We use Scikit-learn for the SVM algorithm and the test set from the UCI optical handwritten digits dataset. We load the dataset and plot some of the digits.
import numpy as np import matplotlib.pyplot as plt from multiprocessing import Pool from sklearn import svm test = np.loadtxt("data/optdigits_test.txt", delimiter = ",") X = test[:, 0:64] y = test[:, 64] # Plot some of the digits fig = plt.figure(figsize=(8, 6)) fig.tight_layout() for i in range(0, 20): ax = fig.add_subplot(5, 5, i + 1) ax.imshow(X[i].reshape((8,8)), cmap = "Greys", vmin = 0, vmax = 16) plt.show()
We implement a parallel 5-fold cross-validation in Python. We parallelize each $k$-fold by dividing the cross-validation function at each $k$ in the following function.
def cvK(X, y, tuning_params, partitions, k): n_tuning_params = tuning_params.shape[0] partition = partitions[k] TRAIN = np.delete(np.arange(0, X.shape[0]), partition) TEST = partition X_train = X[TRAIN, :] y_train = y[TRAIN] X_test = X[TEST, :] y_test = y[TEST] accuracies = np.zeros(n_tuning_params) for i in range(0, n_tuning_params): svc = svm.SVC(C = tuning_params[i], kernel = "linear") accuracies[i] = svc.fit(X_train, y_train).score(X_test, y_test) return accuraciesWe then run each $k$ part of the cross-validation function in parallel and report the best tuning parameter from the cross-validation algorithm.
K = 5 tuning_params = np.logspace(-6, -1, 10) partitions = np.array_split(np.random.permutation([i for i in range(0, X.shape[0])]), K) pool = Pool() args = [(X, y, tuning_params, partitions, k) for k in range(0, K)] Accuracies = np.array(pool.starmap(cvK, args)) pool.close() CV_accuracy = np.mean(Accuracies, axis = 0) best_tuning_param = tuning_params[np.argmax(CV_accuracy)] print(best_tuning_param)
0.0021544346900318843which is the best tuning parameter chosen using 5-fold cross-validation.
References.
http://www.stat.cmu.edu/~larry/=stat705/Lecture16.pdfHoeffding, Wassily (1963). "Probability inequalities for sums of bounded random variables". Journal of the American Statistical Association. 58 (301): 13–30. doi:10.1080/01621459.1963.10500830. JSTOR 2282952. MR 0144363.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Tweets by austindavbrown