If $K = 1$, this is when $Y$ takes values from 1 class out of $K$ classes with probability $p$ with density \[ P(Y = y) = \prod_{k = 1}^K p_k^{y_k}. \] Let $(y_i, x_i)$ be i.i.d. representing the data where $x_i \in \mathbb{R}^d$ and $y_i \in \{0, 1 \}^K$ where $K$ is the number of types of labels. We wish to estimate $P_Y$ by minimizing the relative entropy \[ D(P_Y || Q(x))) \] where $Q$ is an approximation and minimizing an approximation to the entropy $H(Q(x)))$ with \[ H_n(Q(x))) = -\frac{1}{n} \sum_{i = 1}^n \sum_{k = 1}^K y^k_i\log(q_k(x^k_i)). \] We only deal with the entropy of $Q$ since we do not have any information of $P_Y$ which we wish to estimate, so this is equivalent.
We derive the algorithmic differentiation for the multinomial logistic regression. We use the softmax sigmoid $\sigma : \mathbb{R}^K \to [0, 1]^K$ such that \[ \sigma(f) = \left( \sigma_k(f_k) \right)_{k = 1}^K = \left( \frac{\exp{f_k}}{\sum_{k = 1}^K \exp{f_k}} \right)_{k = 1}^K. \] to map into the $K$-probability simplex. The derivative matrix of $\sigma$ is defined by \[ D_j \sigma(f_i) = \sigma(f_i) (1 - \sigma(f_i)) \text{ for $i = j$} \] and \[ D_j \sigma(f_i) = -\sigma(f_i) \sigma(f_i) \text{ for $i \not= j$}. \] We use an affine function for each class $k$ such that \[ f_k(x) = w_k^T x + b_k \] where $w \in \mathbb{R}^d$ and $b_k \in \mathbb{R}$. The derivative (not the gradient) is \begin{align} D_{w_k} f_k(x) = x^T \\\ D_{b_k} f_k(x) = 1 \end{align} We then choose $Q(x) = \sigma(f(x))$. Using that $1^T y = 1$, the derivative of the entropy approximation is \begin{align*} D_{w_l} H_n(\sigma(f(x))) &= \frac{1}{n} \sum_{i = 1}^n -y^i_l D_{g} \log(g) D_{f_l} \sigma_l(f_l) D_{w_l} f_l{(x^i)}) - \sum_{k \not= l} y^i_k D_{g} \log(g) D_{f_k} \sigma_k(f_k) D_{w_l} f_k{(x^i)}) \\\ &= \frac{1}{n} \sum_{i = 1}^n (\sigma_l(f_l{(x^i)}) - y^i_l) {x^i}^T \\\ D_{b_l} H_n(\sigma(f(x))) = &= \frac{1}{n} \sum_{i = 1}^n (\sigma_l(f_l{(x^i)}) - y^i_l) 1 \end{align*} The gradient is \begin{align*} \nabla_{w_l} H_n(\sigma(f(x))) = \frac{1}{n} \sum_{i = 1}^n {x^i} (\sigma_l(f_l{(x^i)}) - y^i_l) \\\ = \frac{1}{n} X^T (\sigma_l(f_l{(X)}) - Y_l) \\\ \nabla_{b_l} H_n(\sigma(f(x))) = \frac{1}{n} \sum_{i = 1}^n (\sigma_l(f_l{(x^i)}) - y^i_l) \\\ = \frac{1}{n} 1_n^T (\sigma_l(f_l{(X)}) - Y_l) \end{align*} We can make a basis of weights and biases in the matrices $W \in M_{d \times n}(\mathbb{R})$ and $b \in \mathbb{R}^K$ so that \begin{align*} dW = \frac{1}{n} X^T (\sigma(f{(X)}) - Y) \\\ db = \frac{1}{n} 1_n^T (\sigma(f{(X)}) - Y) \end{align*} represents the basis of gradients. We train multinomial logistic regression with gradient descent. Let $\eta$ be the learning rate and initial values $W_0, B_0$. Then the gradient descent update for all of the weights and biases is \begin{align*} W^{k + 1} = W^{k} - \eta dW^k \\\ b^{k + 1} = b^{k} - \eta db^k \end{align*} as each iteration $k$.
To predict with multinomial regression, the softmax sigmoid gives us $K$ probability values for each class at each sample $x_i$ and so we take the largest probability value \[ \norm{\sigma(f(x_i))}_{\infty}. \]
Generally the label data is collected as counts where $y_i \in \{ 0,\ldots, K \}$ so we need to convert this into a vector $y_i \in \{ 0, 1 \}^K$ such that $1^T y = 1$. This is usually called the one-hot matrix converting the vector $(y_i)$ into the matrix $Y$. The data then is the matrices $(Y, X)$ where the rows of each matrix correspond to the samples.
Also in this setup, the multinomial distribution is not one-to-one. We could fix this by reducing this to $K - 1$ classes and using $1 - 1_{K - 1}^T p$ as the $K$ probability.
We implement the multinomial regression in python.
import numpy as np
import matplotlib.pyplot as plt
class MultRegression():
@staticmethod
def softmax(M):
exps = np.exp(M)
S_exps = (exps @ np.ones(M.shape[1]))[:, np.newaxis]
return 1/S_exps * exps
@staticmethod
def entropy(Y, P):
n_samples = P.shape[0]
n_classes = P.shape[1]
return -1/n_samples * np.ones(n_samples).T @ (Y * np.log(P) @ np.ones(n_classes))
def train(self, X, y, max_iter = 501, learning_rate = .1):
n_samples = X.shape[0]
n_features = X.shape[1]
n_classes = np.unique(y).shape[0]
# Convert to a multinomial vector
Y = np.zeros((n_samples, n_classes))
Y[np.arange(n_samples), np.array(y.T, dtype = int)] = 1
self.W = np.zeros((n_features, n_classes))
self.b = np.zeros((1, n_classes))
for i in range(0, max_iter):
# Forward pass
A_o = X @ self.W + self.b * np.ones((n_samples, n_classes))
O = MultRegression.softmax(A_o)
if i % 100 == 0:
print(f"Iteration: {i}, Training Loss: {MultRegression.entropy(Y, O)}")
# Backward pass
dW = 1/n_samples * X.T @ (O - Y)
db = 1/n_samples * np.ones((n_samples, 1)).T @ (O - Y)
# Gradient step
self.W = self.W - learning_rate * dW
self.b = self.b - learning_rate * db
def predict(self, X):
n_samples = X.shape[0]
n_classes = self.W.shape[1]
A_o = X @ self.W + self.b * np.ones((n_samples, n_classes))
O = MultRegression.softmax(A_o)
return np.argmax(O, axis = 1)
We load the test set from the UCI optical handwritten digits dataset, do a train/test split, and plot some of the images.
test = np.loadtxt("data/optdigits_test.txt", delimiter = ",")
X = test[:, 0:64]
y = test[:, 64]
# Train/test split
n_samples = X.shape[0]
n_TRAIN = int(.75 * n_samples)
I = np.arange(0, n_samples)
TRAIN = np.random.choice(I, n_TRAIN, replace = False)
TEST = np.setdiff1d(I, TRAIN)
X_train = X[TRAIN, :]
y_train = y[TRAIN]
X_test = X[TEST, :]
y_test = y[TEST]
# Plot some of the digits
fig = plt.figure(figsize=(8, 6))
fig.tight_layout()
for i in range(0, 20):
ax = fig.add_subplot(5, 5, i + 1)
ax.imshow(X[i].reshape((8,8)), cmap = "Greys", vmin = 0, vmax = 16)
plt.show()
We train and report the training and test accuracies.
mlr = MultRegression()
mlr.train(X_train, y_train)
print("Train accuracy:", 1/X_train.shape[0] * np.sum((mlr.predict(X_train) == y_train).astype(int)))
print("Test accuracy:", 1/X_test.shape[0] * np.sum((mlr.predict(X_test) == y_test).astype(int)))
Iteration: 0, Training Loss: 2.3025850929940437 Iteration: 100, Training Loss: 0.0570321213260572 Iteration: 200, Training Loss: 0.023883837390567066 Iteration: 300, Training Loss: 0.014569671068230965 Iteration: 400, Training Loss: 0.010310262882962419 Iteration: 500, Training Loss: 0.008206230605775272 Train accuracy: 1.0 Test accuracy: 0.9688888888888889
References.
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+DigitsIan Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.
Richard O. Duda, Peter E. Hart, and David G. Stork. 2000. Pattern Classification (2nd Edition). Wiley-Interscience, New York, NY, USA.
Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.
Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.
Simon Haykin. 2007. Neural Networks (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
https://github.com/zotroneneis/machine_learning_basics/blob/master/softmax_regression.ipynb
Tweets by austindavbrown