Multinomial Logistic Regression in Python

$ \newcommand{\norm}[1]{\left\lVert#1\right\rVert} $ Suppose we have labels $\{ 0, \ldots, K \}$ and wish to count the occupancy of a each label into the vector $y = \{ y_i, \ldots, y_K \} \in \{0, \ldots, n\}^K$ with the constraint that $1^T y = n$. The permutations of $y$ is the multinomial coefficient \[ {n \choose y_1, \ldots, y_K}. \] This is a constrained counting problem. A random variable $Y : \Omega \to \{0, \ldots, n \} ^K$ corrsponding to this constrained counting problem follows a Multinoimal distribution denoted Multinomial(n, K, p) if $1^T Y = n$ and $p$ is a vector on the $K$ dimensional probability simplex so that $p \in [0, 1]^K$ such that $1^T p = 1$ with density \[ P(Y = y) = {n \choose y_1, \ldots, y_K} \prod_{k = 1}^K p_k^{y_k}. \]

If $K = 1$, this is when $Y$ takes values from 1 class out of $K$ classes with probability $p$ with density \[ P(Y = y) = \prod_{k = 1}^K p_k^{y_k}. \] Let $(y_i, x_i)$ be i.i.d. representing the data where $x_i \in \mathbb{R}^d$ and $y_i \in \{0, 1 \}^K$ where $K$ is the number of types of labels. We wish to estimate $P_Y$ by minimizing the relative entropy \[ D(P_Y || Q(x))) \] where $Q$ is an approximation and minimizing an approximation to the entropy $H(Q(x)))$ with \[ H_n(Q(x))) = -\frac{1}{n} \sum_{i = 1}^n \sum_{k = 1}^K y^k_i\log(q_k(x^k_i)). \] We only deal with the entropy of $Q$ since we do not have any information of $P_Y$ which we wish to estimate, so this is equivalent.

We derive the algorithmic differentiation for the multinomial logistic regression. We use the softmax sigmoid $\sigma : \mathbb{R}^K \to [0, 1]^K$ such that \[ \sigma(f) = \left( \sigma_k(f_k) \right)_{k = 1}^K = \left( \frac{\exp{f_k}}{\sum_{k = 1}^K \exp{f_k}} \right)_{k = 1}^K. \] to map into the $K$-probability simplex. The derivative matrix of $\sigma$ is defined by \[ D_j \sigma(f_i) = \sigma(f_i) (1 - \sigma(f_i)) \text{ for $i = j$} \] and \[ D_j \sigma(f_i) = -\sigma(f_i) \sigma(f_i) \text{ for $i \not= j$}. \] We use an affine function for each class $k$ such that \[ f_k(x) = w_k^T x + b_k \] where $w \in \mathbb{R}^d$ and $b_k \in \mathbb{R}$. The derivative (not the gradient) is \begin{align} D_{w_k} f_k(x) = x^T \\\ D_{b_k} f_k(x) = 1 \end{align} We then choose $Q(x) = \sigma(f(x))$. Using that $1^T y = 1$, the derivative of the entropy approximation is \begin{align*} D_{w_l} H_n(\sigma(f(x))) &= \frac{1}{n} \sum_{i = 1}^n -y^i_l D_{g} \log(g) D_{f_l} \sigma_l(f_l) D_{w_l} f_l{(x^i)}) - \sum_{k \not= l} y^i_k D_{g} \log(g) D_{f_k} \sigma_k(f_k) D_{w_l} f_k{(x^i)}) \\\ &= \frac{1}{n} \sum_{i = 1}^n (\sigma_l(f_l{(x^i)}) - y^i_l) {x^i}^T \\\ D_{b_l} H_n(\sigma(f(x))) = &= \frac{1}{n} \sum_{i = 1}^n (\sigma_l(f_l{(x^i)}) - y^i_l) 1 \end{align*} The gradient is \begin{align*} \nabla_{w_l} H_n(\sigma(f(x))) = \frac{1}{n} \sum_{i = 1}^n {x^i} (\sigma_l(f_l{(x^i)}) - y^i_l) \\\ = \frac{1}{n} X^T (\sigma_l(f_l{(X)}) - Y_l) \\\ \nabla_{b_l} H_n(\sigma(f(x))) = \frac{1}{n} \sum_{i = 1}^n (\sigma_l(f_l{(x^i)}) - y^i_l) \\\ = \frac{1}{n} 1_n^T (\sigma_l(f_l{(X)}) - Y_l) \end{align*} We can make a basis of weights and biases in the matrices $W \in M_{d \times n}(\mathbb{R})$ and $b \in \mathbb{R}^K$ so that \begin{align*} dW = \frac{1}{n} X^T (\sigma(f{(X)}) - Y) \\\ db = \frac{1}{n} 1_n^T (\sigma(f{(X)}) - Y) \end{align*} represents the basis of gradients. We train multinomial logistic regression with gradient descent. Let $\eta$ be the learning rate and initial values $W_0, B_0$. Then the gradient descent update for all of the weights and biases is \begin{align*} W^{k + 1} = W^{k} - \eta dW^k \\\ b^{k + 1} = b^{k} - \eta db^k \end{align*} as each iteration $k$.

To predict with multinomial regression, the softmax sigmoid gives us $K$ probability values for each class at each sample $x_i$ and so we take the largest probability value \[ \norm{\sigma(f(x_i))}_{\infty}. \]

Generally the label data is collected as counts where $y_i \in \{ 0,\ldots, K \}$ so we need to convert this into a vector $y_i \in \{ 0, 1 \}^K$ such that $1^T y = 1$. This is usually called the one-hot matrix converting the vector $(y_i)$ into the matrix $Y$. The data then is the matrices $(Y, X)$ where the rows of each matrix correspond to the samples.

Also in this setup, the multinomial distribution is not one-to-one. We could fix this by reducing this to $K - 1$ classes and using $1 - 1_{K - 1}^T p$ as the $K$ probability.

We implement the multinomial regression in python.

import numpy as np
import matplotlib.pyplot as plt

class MultRegression():
  @staticmethod
  def softmax(M):
    exps = np.exp(M)
    S_exps = (exps @ np.ones(M.shape[1]))[:, np.newaxis]
    return 1/S_exps * exps

  @staticmethod
  def entropy(Y, P):
    n_samples = P.shape[0]
    n_classes = P.shape[1]
    return -1/n_samples * np.ones(n_samples).T @ (Y * np.log(P) @ np.ones(n_classes))
  
  def train(self, X, y, max_iter = 501, learning_rate = .1):
    n_samples = X.shape[0]
    n_features = X.shape[1]
    n_classes = np.unique(y).shape[0]

    # Convert to a multinomial vector
    Y = np.zeros((n_samples, n_classes))
    Y[np.arange(n_samples), np.array(y.T, dtype = int)] = 1

    self.W = np.zeros((n_features, n_classes))
    self.b = np.zeros((1, n_classes))
    for i in range(0, max_iter):
      # Forward pass
      A_o = X @ self.W + self.b * np.ones((n_samples, n_classes))
      O = MultRegression.softmax(A_o)
      if i % 100 == 0:
        print(f"Iteration: {i}, Training Loss: {MultRegression.entropy(Y, O)}")

      # Backward pass
      dW = 1/n_samples * X.T @ (O - Y)
      db = 1/n_samples * np.ones((n_samples, 1)).T @ (O - Y)

      # Gradient step
      self.W = self.W - learning_rate * dW
      self.b = self.b - learning_rate * db

  def predict(self, X):
    n_samples = X.shape[0]
    n_classes = self.W.shape[1]

    A_o = X @ self.W + self.b * np.ones((n_samples, n_classes))
    O = MultRegression.softmax(A_o)
    return np.argmax(O, axis = 1)

We load the test set from the UCI optical handwritten digits dataset, do a train/test split, and plot some of the images.

test = np.loadtxt("data/optdigits_test.txt", delimiter = ",")
X = test[:, 0:64]
y = test[:, 64]

# Train/test split
n_samples = X.shape[0]
n_TRAIN = int(.75 * n_samples)
I = np.arange(0, n_samples)
TRAIN = np.random.choice(I, n_TRAIN, replace = False)
TEST = np.setdiff1d(I, TRAIN)
X_train = X[TRAIN, :]
y_train = y[TRAIN]
X_test = X[TEST, :]
y_test = y[TEST]

# Plot some of the digits
fig = plt.figure(figsize=(8, 6))
fig.tight_layout()
for i in range(0, 20):
    ax = fig.add_subplot(5, 5, i + 1)
    ax.imshow(X[i].reshape((8,8)), cmap = "Greys", vmin = 0, vmax = 16)
plt.show()

We train and report the training and test accuracies.

mlr = MultRegression()
mlr.train(X_train, y_train)

print("Train accuracy:", 1/X_train.shape[0] * np.sum((mlr.predict(X_train) == y_train).astype(int)))
print("Test accuracy:", 1/X_test.shape[0] * np.sum((mlr.predict(X_test) == y_test).astype(int)))

Iteration: 0, Training Loss: 2.3025850929940437
Iteration: 100, Training Loss: 0.0570321213260572
Iteration: 200, Training Loss: 0.023883837390567066
Iteration: 300, Training Loss: 0.014569671068230965
Iteration: 400, Training Loss: 0.010310262882962419
Iteration: 500, Training Loss: 0.008206230605775272

Train accuracy: 1.0
Test accuracy: 0.9688888888888889

References.

http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.

Richard O. Duda, Peter E. Hart, and David G. Stork. 2000. Pattern Classification (2nd Edition). Wiley-Interscience, New York, NY, USA.

Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.

Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.

Simon Haykin. 2007. Neural Networks (3rd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

https://github.com/zotroneneis/machine_learning_basics/blob/master/softmax_regression.ipynb

Tweets by austindavbrown

Austin David Brown Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories

References.

Austin David Brown
Google scholar | Curriculum vitae | Github | arXiv | Linkedin | Posts | Categories