Austin David Brown
Google scholar   |   Curriculum vitae   |   Github   |   arXiv   |   Linkedin   |   Posts   |   Categories

PCA in Python


Principal Components Analysis is an unsupervised learning or lossy compression method. Let $X = (X_i)_{i =1}^n$ such that $X_i \in \mathbb{R}^p$ be the data where $n$ is the number of observations and $p$ is the number of features. The sample covariance matrix with respect to the features is \[ \frac{1}{n} (X - \bar X)^T (X - \bar X) \] where the mean is with respect to the rows. This gives us all of the variance of covariance information of the features. Further, this matrix is positive definite and symmetric, and thus the eigenvalues are guaranteed to exist by the Spectral Theorem. We can analyze the eigenvalues of this matrix. We can view the largest eigenvalues as the most important directions of the data. We can select the $K$ largest eigenvalues and map the data $X$ onto the corresponding eigenvectors. This reduces the dimension if $K < p$. By construction, this is a lossy compression method. Because we are using Euclidean distance, this will capture linear relations. Since we analyze the magnitude of the eigenvalues, it would be unfair if the features were not on the same scale. Thus, we standardize $X$ and the covariance matrix is then $X^T X$.

We will use PCA on the handwritten digits test set from the UCI optical handwritten digits dataset. There are 64 features with labels. The features are the number of pixels ranging from 0-16 from an 8x8 grid preprocessed image.
import numpy as np
import matplotlib.pyplot as plt

test = np.loadtxt("data/optdigits_test.txt", delimiter = ",")
X = test[:, 0:64]
y = test[:, 64]

# Plot digits
fig = plt.figure(figsize=(8, 6))
fig.tight_layout()
for i in range(0, 20):
    ax = fig.add_subplot(5, 5, i + 1)
    ax.imshow(X[i].reshape((8,8)), cmap = "Greys", vmin = 0, vmax = 16)
plt.show()
We standardize with respect to the features
mu = np.mean(X, axis = 0)
sigma = np.std(X, axis = 0)
sX = 1/(sigma + 10**(-10)) * (X - mu)
In order to get the eigenvalues, we need to use numerical linear algebra techniques. From my understanding this is an optimized variant of the QR algorithm. In general, this is $O(n^3)$ complexity for dense matrices. For large matrices, gradient descent will have to be used I believe.
# The covariance matrix of the features! not the observations
E = 1/(sX.shape[0] - 1) * sX.T @ sX
spectrum, U = np.linalg.eig(E)
print("Eigenvectors:", spectrum)
print("Eigenvalues:", U)
Eigenvectors: 
[7.34477606 5.83549054 5.15396118 3.96623597 2.96634519 2.57204442
 2.40600941 2.06867355 1.82993314 1.78951739 1.69784615 1.57287888
 1.38870781 1.35933609 1.32152536 1.16829176 1.08368677 0.99977861
 0.97438293 0.90891242 0.82271926 0.77631014 0.71155675 0.64552365
 0.59527399 0.5765018  0.52673155 0.5106363  0.48686381 0.45560107
 0.44285155 0.42230086 0.3991063  0.39110111 0.36094517 0.34860306
 0.3195963  0.29406627 0.27692285 0.05037444 0.06328961 0.258273
 0.24783029 0.2423566  0.07635394 0.08246812 0.09018543 0.09840876
 0.10250434 0.11188655 0.11932898 0.12426371 0.13321081 0.14311427
 0.217582   0.15818474 0.16875236 0.20799593 0.17612894 0.2000909
 0.18983516 0.         0.         0.        ]

 Eigenvalues: 
 [[ 0.          0.          0.         ...  1.          0.
   0.        ]
 [ 0.18223392 -0.04702701  0.02358821 ...  0.          0.
   0.        ]
 [ 0.285868   -0.0595648  -0.05679875 ...  0.          0.
   0.        ]
 ...
 [ 0.103198    0.24261778 -0.02227952 ...  0.          0.
   0.        ]
 [ 0.1198106   0.16508926  0.10036559 ...  0.          0.
   0.        ]
 [ 0.07149362  0.07132924  0.09244589 ...  0.          0.
   0.        ]]
We can compute the explained variability of each eigenvalue
spectrum_corr = spectrum / np.sum(spectrum)
print("Explained variability:", spectrum_corr)
Explained variability: 
[0.12033916 0.09561054 0.08444415 0.06498408 0.04860155 0.0421412
 0.03942083 0.03389381 0.02998221 0.02932003 0.02781805 0.02577055
 0.02275303 0.0222718  0.02165229 0.01914167 0.01775547 0.01638069
 0.0159646  0.01489191 0.0134797  0.01271931 0.01165837 0.01057647
 0.00975316 0.00944559 0.00863014 0.00836643 0.00797693 0.00746471
 0.00725582 0.00691911 0.00653909 0.00640793 0.00591384 0.00571162
 0.00523637 0.00481808 0.00453719 0.00082535 0.00103696 0.00423163
 0.00406053 0.00397085 0.00125101 0.00135118 0.00147763 0.00161236
 0.00167946 0.00183318 0.00195512 0.00203598 0.00218257 0.00234483
 0.00356493 0.00259175 0.00276489 0.00340787 0.00288575 0.00327835
 0.00311032 0.         0.         0.        ]
We get the largest 2 eigenvalues and then map the standardized matrix of $X$ onto the corresponding eigenvectors
K = 2
I = np.argsort(-spectrum)[0:K]
Z = sX @ U[:, I]
We can look at the corresponding eigenvalues
spectrum[I]
array([7.34477606, 5.83549054])
PCA can be used for visualization. Since we chose only 2 eigenvectors, we can plot the features.
plt.style.use("default")
fig = plt.figure(figsize=(8, 6))
plt.tight_layout()
plt.scatter(Z[:, 0], Z[:, 1], s = 5, c = y)
plt.xlabel('Z1')
plt.ylabel('Z2')
plt.colorbar()
plt.show()
We now look at PCA for prediction using logistic regression. We split the dataset
# Train test split
n_samples = X.shape[0]
n_TRAIN = int(.75 * n_samples)
I = np.arange(0, n_samples)
TRAIN = np.random.choice(I, n_TRAIN, replace = False)
TEST = np.setdiff1d(I, TRAIN)
X_train = X[TRAIN, :]
y_train = y[TRAIN]
X_test = X[TEST, :]
y_test = y[TEST]
We will use Scikit-learn's implementation of logistic regression. We fit using all of the features
# Using all of the features
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = "lbfgs", max_iter = 1000, multi_class = "multinomial")
lr.fit(X_train, y_train)
print("Train accuracy:", 1/X_train.shape[0] * np.sum(lr.predict(X_train) == y_train))
print("Test accuracy", 1/X_test.shape[0] * np.sum(lr.predict(X_test) == y_test))
Train accuracy: 1.0
Test accuracy 0.9622222222222222
We use PCA with $K = 10$.
# Using PCA
K = 10
sX_train = 1/(np.std(X_train, axis = 0) + 10**(-10)) * (X_train - np.mean(X_train, axis = 0))
E_train = 1/(sX_train.shape[0] - 1) * sX_train.T @ sX_train
spectrum_train, U_train = np.linalg.eig(E_train)
I = np.argsort(-spectrum_train)[0:K]
Z_train = sX_train @ U[:, I]
sX_test = 1/(np.std(X_test, axis = 0) + 10**(-10)) * (X_test - np.mean(X_test, axis = 0))
Z_test = sX_test @ U[:, I]

lr = LogisticRegression(solver = "lbfgs", max_iter = 1000, multi_class = "multinomial")
lr.fit(Z_train, y_train)
print("Train accuracy for K =", K, ":", 1/Z_train.shape[0] * np.sum(lr.predict(Z_train) == y_train))
print("Test accuracy for K =", K, ":", 1/Z_test.shape[0] * np.sum(lr.predict(Z_test) == y_test))
Train accuracy for K = 10 : 0.9057164068299925
Test accuracy for K = 10 : 0.8777777777777778
We reduced the dimension significantly but at the cost of prediction accuracy.

References.

https://github.com/ethen8181/machine-learning

http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.

Richard O. Duda, Peter E. Hart, and David G. Stork. 2000. Pattern Classification (2nd Edition). Wiley-Interscience, New York, NY, USA.

Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.

Simon Haykin. 2007. Neural Networks (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.