To train the KNN clasifier requires storing the training data which is $O(1)$ complexity. This is why it is a memory-based classifier. A naive algorithm for each prediction takes a distance calculation at $O(d)$ and $O(n^2)$ quicksort to sort the shortest distances resulting in $O(d n^2)$ complexity. This is quite slow.
An issue is in order for KNN to perform well, it needs to memorize a large amount of training data. Since the training data size grows exponentially in the dimension, this suffers in high-dimensions. A human analogy would be taking an exam by memorizing the lecture notes and then answering with the closest thing memorized from the notes. We implement the KNN class.
import numpy as np import matplotlib.pyplot as plt class KNN(): def __init__(self, k): self.k = k def fit(self, X, y): self.X_train = X self.y_train = y def predict(self, X): pred = np.zeros(X.shape[0]) for i in range(0, X.shape[0]): distances = np.zeros(X_train.shape[0]) for j in range(0, self.X_train.shape[0]): distances[j] = np.sqrt(np.sum(np.power(X[i, :] - X_train[j, :], 2))) k_nearest = y_train[np.argsort(distances)[0:self.k]] k_nearest_set, I = np.unique(k_nearest, return_inverse=True) mode = k_nearest_set[np.argmax(np.bincount(I))] pred[i] = mode return pred
We will use just the test set from the UCI optical handwritten digits dataset.
test = np.loadtxt("data/optdigits_test.txt", delimiter = ",") X = test[:, 0:64] y = test[:, 64] # Plot some of the digits fig = plt.figure(figsize=(8, 6)) fig.tight_layout() for i in range(0, 20): ax = fig.add_subplot(5, 5, i + 1) ax.imshow(X[i].reshape((8,8)), cmap = "Greys", vmin = 0, vmax = 16) plt.show()
We do a train/test split
# Train/test split n_samples = X.shape[0] n_TRAIN = int(.75 * n_samples) I = np.arange(0, n_samples) TRAIN = np.random.choice(I, n_TRAIN, replace = False) TEST = np.setdiff1d(I, TRAIN) X_train = X[TRAIN, :] y_train = y[TRAIN] X_test = X[TEST, :] y_test = y[TEST]We fit for k-nearest neighbors for $k \in 1, 2, 5$ and report the test accuracies.
# Fit K = [1, 2, 5] accuracies = np.array([]) for k in K: classifier = KNN(k = k) classifier.fit(X_train, y_train) accuracy = 1/X_test.shape[0] * np.sum(classifier.predict(X_test) == y_test) accuracies = np.append(accuracies, accuracy) print("Test accuracy for k =", k, ":", accuracy)
Test accuracy for k = 1 : 0.9844444444444445 Test accuracy for k = 2 : 0.9822222222222222 Test accuracy for k = 5 : 0.9822222222222222We plot the test accuracies.
fig = plt.figure(figsize=(8, 6)) plt.plot(K, accuracies, marker = 'o') plt.xlabel('K') plt.ylabel('Accuracy') plt.show()
References.
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+DigitsIan Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.
Richard O. Duda, Peter E. Hart, and David G. Stork. 2000. Pattern Classification (2nd Edition). Wiley-Interscience, New York, NY, USA.
Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.
Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.
Simon Haykin. 2007. Neural Networks (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
Tweets by austindavbrown