Exercise - Multiclass Logistic Regression (Softmax) with PyTorch

Introduction

Teaching objectives of this notebook are:

  • Implementing a logistic regression model using PyTorch
  • Understanding how to use PyTorch's autograd feature by implementing gradient descent.

In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal.

Requirements

Knowledge

You should posess knowledge about:

  • Logistic regression
  • Softmax
  • Gradient descent
  • Chapter 5 and 6 of the Deep Learning Book
  • Chapter 5 of the book Pattern Recognition and Machine Learning by Christopher M. Bishop [BIS07]
  • Video 15.3 and following in the playlist Machine Learning

Python Modules

import numpy as np

import scipy.stats
from scipy.stats import norm

from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize

%matplotlib inline
import torch
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(1)

Trianing Data

-$ m $-Training data$ \mathcal D = \{(\vec x^{(1)}, y^{(1)}),(\vec x^{(2)},y^{(2)}), \dots ,(\vec x^{(m)},y^{(m)})\} $

here with

  • two features$ \vec x = (x_1, x_2)^T $
  • three classes:$ y \in \{ 0, 1, 2\} $
# class 0:
# covariance matrix and mean
cov0 = np.array([[5,-4],[-4,4]])
mean0 = np.array([2.,3])
# number of data points
m0 = 100

# class 1
# covariance matrix
cov1 = np.array([[5,-3],[-3,3]])
mean1 = np.array([0.5,0.5])
m1 = 100

# class 2
# covariance matrix
cov2 = np.array([[2,0],[0,2]])
mean2 = np.array([8.,-5])
m2 = 100

# generate m0 gaussian distributed data points with
# mean0 and cov0.
r0 = np.random.multivariate_normal(mean0, cov0, m0)
r1 = np.random.multivariate_normal(mean1, cov1, m1)
r2 = np.random.multivariate_normal(mean2, cov2, m2)

def plot_data(r0, r1, r2):
    plt.figure(figsize=(7.,7.))
    plt.scatter(r0[...,0], r0[...,1], c='r', marker='o', label="Klasse 0")
    plt.scatter(r1[...,0], r1[...,1], c='y', marker='o', label="Klasse 1")
    plt.scatter(r2[...,0], r2[...,1], c='b', marker='o', label="Klasse 2")
    plt.xlabel("$x_0$")
    plt.ylabel("$x_1$")
plot_data(r0, r1, r2)
X = np.concatenate((r0, r1, r2), axis=0)
X.shape
y = np.concatenate((np.zeros(m0), np.ones(m1), 2 * np.ones(m2)))
y.shape
# shuffle the data
assert X.shape[0] == y.shape[0]
perm = np.random.permutation(np.arange(X.shape[0]))
X = X[perm]
y = y[perm]

Exercise - Implement the Model

Since we have concrete classes and not contiunous values, we have to implement logistic regression (opposed to linear regression). Logistic regression implies the use of the logistic function. But as the number of classes exceeds two, we have to use the generalized form, the softmax function.

Task:

Implement softmax regression. This can be split into three subtasks: 1. Implement the softmax function for prediction. 2. Implement the computation of the cross-entropy loss. 3. Implement vanilla gradient descent.

Softmax

The softmax-function is defined as:

$ \sigma(z_j) = \frac{exp(z_j)}{\sum_{k=1}^{K}exp(z_k)} $

with: -$ z_j = \theta_{0,j} + \theta_{1,j} x_1 + \theta_{2,j} x_2 $, as we have two features. -$ j $ the actual class. We have three, so$ j \in \{0,1,2\} $. -$ K $, the number of classes.

So for every training example, we compute$ \sigma(z_j) $ for every class. The result is then the probability, current training example$ x^{(i)} $ belongs to class$ j $. Like always with probabilities, they should sum to$ 1.0 $.

Visualizing the data flow:

softmax_regression_2_features_3_classes.svg

Task 1:

Implement Softmax Regression as an nn.Module. If you have done the notebook about linear regression before, you should already be familiar with torch.nn.Linear. Just pipe its output with torch.nn.Softmax.

Again. Add torch.nn.Linear and torch.nn.Softmax as class members and use them in the forward method.

If you do not want to use PyTorchs built-in functions, you can of course implement the softmax function yourself ;-)

Hint:

In our case, with two features, the input data has the shape (m_examples, n_features):

    tensor([[-0.6617, -0.0426],
            [-1.3328,  0.5161],
            ....

The forward method should return the probabilities for the three classes, e.g.

    tensor([[ 0.1757,  0.3948,  0.4295],
            [ 0.0777,  0.3502,  0.5721], 
            ....
class SoftmaxRegression(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, num_features):

        super(SoftmaxRegression, self).__init__()

        ###############################
        ##### YOUR SOLUTION START #####
        ###############################

        raise NotImplementedError()

        ###############################
        ##### YOUR SOLUTION End   #####
        ###############################
    
    def forward(self, x):
        ###############################
        ##### YOUR SOLUTION START #####
        ###############################
        # should return the probabilities for the classes, e.g.
        # tensor([[ 0.1757,  0.3948,  0.4295],
        #         [ 0.0777,  0.3502,  0.5721], 
        #         ...

        raise NotImplementedError()

        ###############################
        ##### YOUR SOLUTION End   #####
        ###############################        
NUM_LABELS = 3
NUM_FEATURES = 2
model = SoftmaxRegression(NUM_LABELS, NUM_FEATURES)
### Should output something like:
###
### SoftmaxRegression(
###   (linear): Linear(in_features=2, out_features=3, bias=True)
###   (softmax): Softmax()
### )
print(model)

The output of the cell bellow should be something like that (numbers can vary as they get randomly initialized):

Parameter containing:
tensor([[ 0.3643, -0.3121],
        [-0.1371,  0.3319],
        [-0.6657,  0.4241]], requires_grad=True)
Parameter containing:
tensor([-0.1455,  0.3597,  0.0983], requires_grad=True)

Task:

Take a look at the following graph, depicting our architecture and try to answere:

  • Which variables in the graph correspond to which tensors in the print statements below.

softmax_regression_2_features_3_classes.svg

### Iterate through our trainable parameters
for param in model.parameters():
    print (param)
    
### If you have no idea uncomment and execute the line below:
#model.state_dict()
# test if the probabilities for an example sum to 1:
data = torch.randn(4, 2) # 4 examples with two features each
pred = model(data)

# the probabilities for an example should sum to 1:
np.testing.assert_allclose(pred.detach().numpy().sum(axis=1), 1.)
print(pred)

Cross-Entropy

Task 2:

Implement the computation of the cross-entropy loss. Don't use any build-in function of PyTorch for the cross-entropy.

Reminder:

\begin{equation} \begin{split} H(p, q) & = \sum_{k=0}^K p_k(x) \cdot \log \frac{1}{q_k(x)} \\ & = -\sum_{i=0}^c p_k(x) \cdot \log q_k(x) \\ \end{split} \end{equation}

with

  • the number of classes K *$ p(x) $ the true probabilities for the classes
  • Hint: We assume this is always 1.0 for the correct class and 0.0 for the other classes
  • and the predictions of our net$ q(x) $ (softmax output)

Hint:

Return the cross-entropy average: $ J(\theta) = \frac{1}{m} \sum_{j=1}^m H\left(p(\vec x^{(j)}),q(\vec x^{(j)})\right) $

# method that returns the cross-entropy computed with pytorch
# so we can use the grad for gradient descent
def cross_entropy(predictions, targets):
    
    ###############################
    ##### YOUR SOLUTION START #####
    ###############################
    #
    # Task: cross-entropy average as pytorch tensor (scalar)

    raise NotImplementedError()

    ###############################
    ##### YOUR SOLUTION End   #####
    ###############################
    raise NotImplementedError()
targets = torch.tensor([0,2,1,0], dtype=torch.int64)
pred[np.arange(4), targets]

costs = cross_entropy(pred, targets).item()
print(costs)

# costs should be a float >= 0.0
assert costs >= 0.0

Gradient Descent

Task 3:

Train the model with gradient descent.

  • Convert the data to torch tensors.
  • Implement the gradient descent update rule.
  • Apply iteratively the update rule to minimize the loss.

    • Hint: Print the costs every ~100 epochs to get instant feedback about the training success

Reminder:

Equation for the update rule:

$ \begin{align} \theta_j' & = \theta_j - \alpha \cdot \frac{\partial}{\partial \theta_j} J(\theta)\\\\ \end{align} $

###############################
##### YOUR SOLUTION START #####
###############################
#
# Task: Convert numpy arrays to tensors
#

###############################
##### YOUR SOLUTION End   #####
###############################
### If your implementation is corret, these tests should not throw and exception

print(X_tensor.shape) ### should be [300,2]
print(y_tensor.shape) ### should be [300]

assert X_tensor.shape[0] == 300
assert X_tensor.shape[1] == 2
assert y_tensor.shape[0] == 300
def update_step(model, loss_function, x_, y_, lr):
    
    ###############################
    ##### YOUR SOLUTION START #####
    ###############################

    raise NotImplementedError()

    ###############################
    ##### YOUR SOLUTION End   #####
    ###############################
def gradient_descent(data, targets, loss_function, model, lr = 0.5, nb_epochs = 1000):
    
    ###############################
    ##### YOUR SOLUTION START #####
    ###############################

    raise NotImplementedError()

    ###############################
    ##### YOUR SOLUTION End   #####
    ###############################
nb_epochs = 1000
# cost is a numpy array with the cost function value at each iteration.
# will be used below to print the progress during learning
cost = gradient_descent(X_tensor, y_tensor, loss_function=cross_entropy, model=model, lr = 0.5, nb_epochs = nb_epochs)

Plot

Cost-(Loss)-over-Iterations

Plot the costs per epoch. Just execute the cells. The output should look similar to the following:

plt.plot(range(nb_epochs), cost)
plt.xlabel('# of iterations')
plt.ylabel('cost')
plt.title('Learning Progress')

Decision-Boundary-After-Training

Plot the data with the decisions. Just execute the cells. The output should look similar to the following:

def plot_decision_boundary(model):    
    fig = plt.figure(figsize=(8,8))

    x_start = -5
    x_end = 15
    y_start = -10
    y_end = 10
    plt.xlim(x_start, x_end)
    plt.ylim(y_start, y_end)
    

    delta = 0.1
    a = np.arange(x_start, x_end+delta, delta)
    b = np.arange(y_start, y_end+delta, delta)
    A, B = np.meshgrid(a, b)

    x_ = np.dstack((A, B)).reshape(-1, 2)
    x = torch.tensor(x_, dtype=torch.float32) 
    pred = model(x)
    out = pred.detach().numpy()

    ns = list()
    ns.append(3)
    ns.extend(A.shape)
    out = out.T.reshape(ns)

    plt.pcolor(A, B, out[0], cmap="Blues", alpha=0.2)
    plt.pcolor(A, B, out[1], cmap=('Oranges'), alpha=0.2)
    plt.pcolor(A, B, out[2], cmap=('Greens'), alpha=0.2)
    #out.shape
    # lets visualize the data:
    plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)

    plt.title("Data and class predictions in data space.")
plot_decision_boundary(model)

Using PyTorch Built-Ins

Task:

Now create a new model with untrained parameters and this time use PyTorchs built-ins:

  • torch.nn.CrossEntropyLoss for the costs function.
  • torch.optim.SGD, optim.Adam or any other optimizer to update your model.
###############################
##### YOUR SOLUTION START #####
###############################
#
# Task: Create a new model and train with with built-in cost and optimizer


###############################
##### YOUR SOLUTION End   #####
###############################
### your latest model you just trained should be named "model"
### alternatively adjust the method parameter to fit your models variable
plot_decision_boundary(model)

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise - Multiclass Logistic Regression (Softmax) with PyTorch
by Christian Herta, Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christian Herta, Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.