K-nearest Neighbors in Python | Austin David Brown

$ \newcommand{\norm}[1]{\left\lVert#1\right\rVert} $ The K-nearest neighbors classifier (KNN) is a memory-based classifier. It will be clear what this means later. Let $(X_i, y_i)$ are i.i.d. with distribution $p(x, y) dm$ where $X_i \in \mathbb{R}^d$ are the features and $y \in \{c_1, \ldots, c_m \}$ are the labels. Choose $k \in \mathbb{N}$ and let $\mathcal{D}_n = (X_i, y_i)_{i = 1}^n$ be the training data. We can partition the feature space by $N_k(x, \mathcal{D}_n)$ being the k-nearest points to $x$ using a metric typically chosen as the $l^2$ norm. We can estimate $p(x, y)$ with \[ p_n(x, y = c_i | D_n, k) = \frac{1}{n} \sum_{j \in N_k(x, \mathcal{D}_n)} I(y_j = c_i) \] and this yields posterior probability \[ P(y = c_i | x, \mathcal{D}_n, k) = \frac{1}{k} \sum_{j \in N_k(x, \mathcal{D}_n)} I(y_j = c_i) \] The classification rule is then \[ \norm{P(y | x, \mathcal{D}_n, k)}_\infty \] which partitions the feature space into a Voronoi diagram. This allows for a very complicated decision boundary. More complicated than decision trees even. Some theory can be established to bound this probability by 2 time the Bayes classifier.

To train the KNN clasifier requires storing the training data which is $O(1)$ complexity. This is why it is a memory-based classifier. A naive algorithm for each prediction takes a distance calculation at $O(d)$ and $O(n^2)$ quicksort to sort the shortest distances resulting in $O(d n^2)$ complexity. This is quite slow.

An issue is in order for KNN to perform well, it needs to memorize a large amount of training data. Since the training data size grows exponentially in the dimension, this suffers in high-dimensions. A human analogy would be taking an exam by memorizing the lecture notes and then answering with the closest thing memorized from the notes. We implement the KNN class.

import numpy as np
import matplotlib.pyplot as plt

class KNN():
  def __init__(self, k):
    self.k = k

  def fit(self, X, y):
    self.X_train = X
    self.y_train = y

  def predict(self, X):
    pred = np.zeros(X.shape[0])
    for i in range(0, X.shape[0]):
      distances = np.zeros(X_train.shape[0])
      for j in range(0, self.X_train.shape[0]):
        distances[j] = np.sqrt(np.sum(np.power(X[i, :] - X_train[j, :], 2)))
      k_nearest = y_train[np.argsort(distances)[0:self.k]]
      k_nearest_set, I = np.unique(k_nearest, return_inverse=True)
      mode = k_nearest_set[np.argmax(np.bincount(I))]
      pred[i] = mode
    return pred

We will use just the test set from the UCI optical handwritten digits dataset.

test = np.loadtxt("data/optdigits_test.txt", delimiter = ",")
X = test[:, 0:64]
y = test[:, 64]

# Plot some of the digits
fig = plt.figure(figsize=(8, 6))
fig.tight_layout()
for i in range(0, 20):
    ax = fig.add_subplot(5, 5, i + 1)
    ax.imshow(X[i].reshape((8,8)), cmap = "Greys", vmin = 0, vmax = 16)
plt.show()

We do a train/test split

# Train/test split
n_samples = X.shape[0]
n_TRAIN = int(.75 * n_samples)
I = np.arange(0, n_samples)
TRAIN = np.random.choice(I, n_TRAIN, replace = False)
TEST = np.setdiff1d(I, TRAIN)
X_train = X[TRAIN, :]
y_train = y[TRAIN]
X_test = X[TEST, :]
y_test = y[TEST]

We fit for k-nearest neighbors for $k \in 1, 2, 5$ and report the test accuracies.

# Fit
K = [1, 2, 5]
accuracies = np.array([])
for k in K:
  classifier = KNN(k = k)
  classifier.fit(X_train, y_train)
  accuracy = 1/X_test.shape[0] * np.sum(classifier.predict(X_test) == y_test)
  accuracies = np.append(accuracies, accuracy)
  print("Test accuracy for k =", k, ":", accuracy)

Test accuracy for k = 1 : 0.9844444444444445
Test accuracy for k = 2 : 0.9822222222222222
Test accuracy for k = 5 : 0.9822222222222222

We plot the test accuracies.

fig = plt.figure(figsize=(8, 6))
plt.plot(K, accuracies, marker = 'o')
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.show()

References.

http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.

Richard O. Duda, Peter E. Hart, and David G. Stork. 2000. Pattern Classification (2nd Edition). Wiley-Interscience, New York, NY, USA.

Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.

Simon Haykin. 2007. Neural Networks (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

Tweets by austindavbrown

Austin David Brown Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories

References.

Austin David Brown
Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories