If $K = 1$, this is when $Y$ takes values from 1 class out of $K$ classes with probability $p$ with density \[ P(Y = y) = \prod_{k = 1}^K p_k^{y_k}. \] Let $(y_i, x_i)$ be i.i.d. representing the data where $x_i \in \mathbb{R}^d$ and $y_i \in \{0, 1 \}^K$ where $K$ is the number of types of labels. We wish to estimate $P_Y$ by minimizing the relative entropy \[ D(P_Y || Q(x))) \] where $Q$ is an approximation and minimizing an approximation to the entropy $H(Q(x)))$ with \[ H_n(Q(x))) = -\frac{1}{n} \sum_{i = 1}^n \sum_{k = 1}^K y^k_i\log(q_k(x^k_i)). \] We only deal with the entropy of $Q$ since we do not have any information of $P_Y$ which we wish to estimate, so this is equivalent.
We derive the algorithmic differentiation for the multinomial logistic regression. We use the softmax sigmoid $\sigma : \mathbb{R}^K \to [0, 1]^K$ such that \[ \sigma(f) = \left( \sigma_k(f_k) \right)_{k = 1}^K = \left( \frac{\exp{f_k}}{\sum_{k = 1}^K \exp{f_k}} \right)_{k = 1}^K. \] to map into the $K$-probability simplex. The derivative matrix of $\sigma$ is defined by \[ D_j \sigma(f_i) = \sigma(f_i) (1 - \sigma(f_i)) \text{ for $i = j$} \] and \[ D_j \sigma(f_i) = -\sigma(f_i) \sigma(f_i) \text{ for $i \not= j$}. \] We use an affine function for each class $k$ such that \[ f_k(x) = w_k^T x + b_k \] where $w \in \mathbb{R}^d$ and $b_k \in \mathbb{R}$. The derivative (not the gradient) is \begin{align} D_{w_k} f_k(x) = x^T \\\ D_{b_k} f_k(x) = 1 \end{align} We then choose $Q(x) = \sigma(f(x))$. Using that $1^T y = 1$, the derivative of the entropy approximation is \begin{align*} D_{w_l} H_n(\sigma(f(x))) &= \frac{1}{n} \sum_{i = 1}^n -y^i_l D_{g} \log(g) D_{f_l} \sigma_l(f_l) D_{w_l} f_l{(x^i)}) - \sum_{k \not= l} y^i_k D_{g} \log(g) D_{f_k} \sigma_k(f_k) D_{w_l} f_k{(x^i)}) \\\ &= \frac{1}{n} \sum_{i = 1}^n (\sigma_l(f_l{(x^i)}) - y^i_l) {x^i}^T \\\ D_{b_l} H_n(\sigma(f(x))) = &= \frac{1}{n} \sum_{i = 1}^n (\sigma_l(f_l{(x^i)}) - y^i_l) 1 \end{align*} The gradient is \begin{align*} \nabla_{w_l} H_n(\sigma(f(x))) = \frac{1}{n} \sum_{i = 1}^n {x^i} (\sigma_l(f_l{(x^i)}) - y^i_l) \\\ = \frac{1}{n} X^T (\sigma_l(f_l{(X)}) - Y_l) \\\ \nabla_{b_l} H_n(\sigma(f(x))) = \frac{1}{n} \sum_{i = 1}^n (\sigma_l(f_l{(x^i)}) - y^i_l) \\\ = \frac{1}{n} 1_n^T (\sigma_l(f_l{(X)}) - Y_l) \end{align*} We can make a basis of weights and biases in the matrices $W \in M_{d \times n}(\mathbb{R})$ and $b \in \mathbb{R}^K$ so that \begin{align*} dW = \frac{1}{n} X^T (\sigma(f{(X)}) - Y) \\\ db = \frac{1}{n} 1_n^T (\sigma(f{(X)}) - Y) \end{align*} represents the basis of gradients. We train multinomial logistic regression with gradient descent. Let $\eta$ be the learning rate and initial values $W_0, B_0$. Then the gradient descent update for all of the weights and biases is \begin{align*} W^{k + 1} = W^{k} - \eta dW^k \\\ b^{k + 1} = b^{k} - \eta db^k \end{align*} as each iteration $k$.
To predict with multinomial regression, the softmax sigmoid gives us $K$ probability values for each class at each sample $x_i$ and so we take the largest probability value \[ \norm{\sigma(f(x_i))}_{\infty}. \]
Generally the label data is collected as counts where $y_i \in \{ 0,\ldots, K \}$ so we need to convert this into a vector $y_i \in \{ 0, 1 \}^K$ such that $1^T y = 1$. This is usually called the one-hot matrix converting the vector $(y_i)$ into the matrix $Y$. The data then is the matrices $(Y, X)$ where the rows of each matrix correspond to the samples.
Also in this setup, the multinomial distribution is not one-to-one. We could fix this by reducing this to $K - 1$ classes and using $1 - 1_{K - 1}^T p$ as the $K$ probability.
We implement the multinomial regression in python.
import numpy as np import matplotlib.pyplot as plt class MultRegression(): @staticmethod def softmax(M): exps = np.exp(M) S_exps = (exps @ np.ones(M.shape[1]))[:, np.newaxis] return 1/S_exps * exps @staticmethod def entropy(Y, P): n_samples = P.shape[0] n_classes = P.shape[1] return -1/n_samples * np.ones(n_samples).T @ (Y * np.log(P) @ np.ones(n_classes)) def train(self, X, y, max_iter = 501, learning_rate = .1): n_samples = X.shape[0] n_features = X.shape[1] n_classes = np.unique(y).shape[0] # Convert to a multinomial vector Y = np.zeros((n_samples, n_classes)) Y[np.arange(n_samples), np.array(y.T, dtype = int)] = 1 self.W = np.zeros((n_features, n_classes)) self.b = np.zeros((1, n_classes)) for i in range(0, max_iter): # Forward pass A_o = X @ self.W + self.b * np.ones((n_samples, n_classes)) O = MultRegression.softmax(A_o) if i % 100 == 0: print(f"Iteration: {i}, Training Loss: {MultRegression.entropy(Y, O)}") # Backward pass dW = 1/n_samples * X.T @ (O - Y) db = 1/n_samples * np.ones((n_samples, 1)).T @ (O - Y) # Gradient step self.W = self.W - learning_rate * dW self.b = self.b - learning_rate * db def predict(self, X): n_samples = X.shape[0] n_classes = self.W.shape[1] A_o = X @ self.W + self.b * np.ones((n_samples, n_classes)) O = MultRegression.softmax(A_o) return np.argmax(O, axis = 1)We load the test set from the UCI optical handwritten digits dataset, do a train/test split, and plot some of the images.
test = np.loadtxt("data/optdigits_test.txt", delimiter = ",") X = test[:, 0:64] y = test[:, 64] # Train/test split n_samples = X.shape[0] n_TRAIN = int(.75 * n_samples) I = np.arange(0, n_samples) TRAIN = np.random.choice(I, n_TRAIN, replace = False) TEST = np.setdiff1d(I, TRAIN) X_train = X[TRAIN, :] y_train = y[TRAIN] X_test = X[TEST, :] y_test = y[TEST] # Plot some of the digits fig = plt.figure(figsize=(8, 6)) fig.tight_layout() for i in range(0, 20): ax = fig.add_subplot(5, 5, i + 1) ax.imshow(X[i].reshape((8,8)), cmap = "Greys", vmin = 0, vmax = 16) plt.show()
We train and report the training and test accuracies.
mlr = MultRegression() mlr.train(X_train, y_train) print("Train accuracy:", 1/X_train.shape[0] * np.sum((mlr.predict(X_train) == y_train).astype(int))) print("Test accuracy:", 1/X_test.shape[0] * np.sum((mlr.predict(X_test) == y_test).astype(int)))
Iteration: 0, Training Loss: 2.3025850929940437 Iteration: 100, Training Loss: 0.0570321213260572 Iteration: 200, Training Loss: 0.023883837390567066 Iteration: 300, Training Loss: 0.014569671068230965 Iteration: 400, Training Loss: 0.010310262882962419 Iteration: 500, Training Loss: 0.008206230605775272 Train accuracy: 1.0 Test accuracy: 0.9688888888888889
References.
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+DigitsIan Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.
Richard O. Duda, Peter E. Hart, and David G. Stork. 2000. Pattern Classification (2nd Edition). Wiley-Interscience, New York, NY, USA.
Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.
Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.
Simon Haykin. 2007. Neural Networks (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
https://github.com/zotroneneis/machine_learning_basics/blob/master/softmax_regression.ipynb
Tweets by austindavbrown