HTW Berlin - Angewandte Informatik - Advanced Topics - Exercise - Multiclass Logistic Regression (Softmax) with TensorFlow

Introduction
Requirements
- Knowledge
- Modules
Exercise - Multiclass Logistic Regression (Softmax) with tensorflow
Literature
Licenses

Introduction

In this exercise notebook you will implement a multiclass logistic regression model using TensorFlow. To do so, one would normally use TensorFlow's predefined functions for the softmax prediction, the cross-entropy costs and an optimizer based on the gradient descent update algorithm.

Here you will not use any of them, but implement them yourself only using basic TensorFlow functions like tf.matmul, tf.transpose, etc. An exception is the tf.gradients function, which returns the gradient of a function with respect to a variable / list of variables. This gradient can then be used to define the update algorithm.

Besides consolidating your theoretical knowledge about gradient descent, knowing how to use the TensorFlow's autograd feature can be very useful when you want to do anything which can be calculated with a gradient but is not covered with the standard built-ins, e.g. define your own cost and update function.

In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal. In this notebook, however, these cells will only detect a small portion of possible errors, e.g. your implemented function returning a wrong shape.

Requirements

Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

Logistic regression
Softmax function
Cross-entropy
Gradient descent
Basic TensorFlow dataflow (see below)

The following material can help you to acquire this knowledge:

Softmax, cross-entropy, gradient descent:
Chapter 5 and 6 of the Deep Learning Book
Chapter 5 of the book Pattern Recognition and Machine Learning by Christopher M. Bishop [BIS07]
Logistic Regression (binary):
Video 15.3 and following in the playlist Machine Learning
TensorFlow:
TensorFlow dataflow
TensorFlow gradient computation

Python Modules

# External Modules
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from numpy.testing import assert_almost_equal

if int(tf.__version__.split('.')[0]) != 2:
    raise Warning('ATTENTION: This notebook was designed with tensorflow version 1.xx in mind.\n\n Suggested tensorflow methods during the exercises might NOT function as described. We suggest installing the latest 1.xx version of tensorflow in a seperate environment to solve this exercise.')

%matplotlib inline

tf.reset_default_graph()
sess = tf.InteractiveSession()

Exercise - Multiclass Logistic Regression (Softmax) with TensorFlow

Training Data

Given$ m $ examples in our training data$ \mathcal D = \{(\vec x^{(1)}, y^{(1)}),(\vec x^{(2)},y^{(2)}), \dots (\vec x^{(m)},y^{(m)})\} $, with$ \vec x^{(1)} $ denoting the first feature vector and$ y^{(1)} $ the corresponding class.

We will create our own training data by drawing samples from different gaussian distributions, which our model should be capable of generalizing. To make things concrete we will be using:

two features$ \vec x = (x_1, x_2)^T $
three classes:$ y \in \{ 0, 1, 2\} $
100 examples for each class

# class 0:
# covariance matrix and mean
cov0 = np.array([[5,-4],[-4,4]])
mean0 = np.array([2.,3])
# number of data points
m0 = 100

# class 1
# covariance matrix and mean
cov1 = np.array([[5,-3],[-3,3]])
mean1 = np.array([0.5,0.5])
m1 = 100

# class 2
# covariance matrix mean
cov2 = np.array([[2,0],[0,2]])
mean2 = np.array([8.,-5])
m2 = 100

# generate m0 gaussian distributed data points with
# mean0 and cov0.
r0 = np.random.multivariate_normal(mean0, cov0, m0)
r1 = np.random.multivariate_normal(mean1, cov1, m1)
r2 = np.random.multivariate_normal(mean2, cov2, m2)

def plot_data(r0, r1, r2):
    plt.figure(figsize=(7.,7.))
    plt.scatter(r0[...,0], r0[...,1], c='r', marker='o', label="Klasse 0")
    plt.scatter(r1[...,0], r1[...,1], c='y', marker='o', label="Klasse 1")
    plt.scatter(r2[...,0], r2[...,1], c='b', marker='o', label="Klasse 2")
    plt.xlabel("$x_0$")
    plt.ylabel("$x_1$")

# Let's visualize our training data

plot_data(r0, r1, r2)

X = np.concatenate((r0, r1, r2), axis=0)
X.shape

y = np.concatenate((np.zeros(m0), np.ones(m1), 2 * np.ones(m2)))
y.shape

# shuffle the data
assert X.shape[0] == y.shape[0]
perm = np.random.permutation(np.arange(X.shape[0]))
#print(perm)
X = X[perm]
y = y[perm]

Implement the Model

Since we have concrete classes and not contiunous values, we have to implement logistic regression (opposed to linear regression). logistic regression implies the use of the logistic function. But as the number of classes exceeds two, we have to use the generalized form, the softmax function.

Task:

Implement softmax regression. This can be split into three subtasks: 1. Implement the softmax function for prediction. 2. Implement the computation of the cross-entropy loss. 3. Implement vanilla gradient descent.

Softmax

Task 1:

Implement the softmax prediction$ h_i $, defined for each class$ i $ as:

$ h_i = \frac{\exp(z_i)}{\sum_{k=1}^c\exp (z_k)} $

with$ c $ denoting the class label and the net output$ z_i $ for that class, where the whole vector$ \vec z $ is defined as:

$ \vec z = W \vec{x} + \vec b $

Hint:

Remember that your functions should be able to handle multiple or even all$ \vec x $s.

Evaluating softmax should look like:

    in> h.eval(feed_dict={x: X})

    out> array([[1.62411915e-08, 1.70372473e-03, 9.98296261e-01],
                [3.72431863e-08, 3.27572320e-03, 9.96724248e-01],
                [9.83378708e-01, 1.66097078e-02, 1.15793373e-05],
                .....

### First we define Variables for the weigths W and bias b.
### From Docstring:
### "A variable maintains state in the graph across calls to run() ...
### ... constructor requires an initial value ..." 
NUM_LABELS = 3
NUM_FEATURES = 2
D_TYPE = tf.float32
I_TYPE = tf.int32
W = tf.Variable(tf.random_uniform([NUM_FEATURES, NUM_LABELS], dtype=D_TYPE))
b = tf.Variable(tf.zeros([NUM_LABELS], dtype=D_TYPE))

### And placeholders for the training data.
### From Docstring:
### "This tensor will produce an error if evaluated. Its value must
### be fed using the `feed_dict` optional argument to `Session.run()`

### Using None in the first dimension allows to feed a variable number
x = tf.placeholder(shape=[None, NUM_FEATURES], dtype=D_TYPE, name="features")
t = tf.placeholder(shape=[None], dtype=I_TYPE, name="targets")

### Variables must be initialized by running an `init` Op after having
### launched the graph.  We first have to add the `init` Op to the graph.

init_op= tf.global_variables_initializer()
sess.run(init_op)

### Implement this function

def net_output(x, W, b):
    """
    Calculates the net output z = W * x + b.
    
    :x: Predicitons.
    :x type: 2D-Tensor of type float32 with 
            shape (n_examples, n_features).
    :W: Weight matrix.
    :W type: 2D-Tensor of type float32 with 
            shape (n_features, n_classes).
    :b: Weight matrix.
    :b type: D-Tensor of type float32 with 
            shape (n_classes).
        
    :returns: The net output
    :r type: 2D-Tensor of type float32
            with shape (n_examples, n_classes).
    """
    raise NotImplementedError()

### Implement this function

def softmax(z):
    """
    Returns the normalized predictions z.
    
    :z: Predicitons.
    :z type: 2D-Tensor of type float32 with 
            shape (n_examples, n_classes).
    
    :returns: softmax prediction.
    :r type: Tensor with same type and shape as z.
    """
    raise NotImplementedError()

z = net_output(x, W, b)
h = softmax(z)

some_predictions = h.eval(feed_dict={x: X[0:2]})
print(some_predictions)

assert_almost_equal(some_predictions[0].sum(), 1.0)
assert_almost_equal(some_predictions[1].sum(), 1.0)

Cross-Entropy

Task 2:

Implement the computation of the cross-entropy loss. Don't use any build-in function of TensorFlow for the cross-entropy.

Reminder:

\begin{equation} \begin{split} H(p, q) & = \sum_{i=0}^c p_i(x) \cdot \log \frac{1}{q_i(x)} \\ & = -\sum_{i=0}^c p_i(x) \cdot \log q_i(x) \\ \end{split} \end{equation}

with

the number of classes c
the correct class distribution$ p(x) $
and the predictions of our net$ q(x) $ (softmax output)

Hint:

Return the cross-entropy average: $ J(W,b) = \frac{1}{m} \sum_{j=1}^m H\left(p(\vec x^{(j)}),q(\vec x^{(j)})\right) $

### Implement this function

def cross_entropy(targets, predictions):
    """
    Computes the cross-entropy average.
    
    :targets: True classes as scalars.
    :targets type: tf.Tensor with the shape (n_classes).
    :predictions: predictions as softmax output 
    :predicitons type: tf.Tensor with shape (n_examples, n_classes).
    
    :returns: cross-entropy average.
    :r type: Tensor of type float32
    """
    raise NotImplementedError()

# t is the tensorflow placeholder for the targets (class labels)
cost = cross_entropy(t, h)

some_cost = cost.eval(feed_dict={x: X, t: y})
print(some_cost)

assert some_cost.dtype == np.float32

Gradient Descent

Task 3:

Implement gradient descent and train the model:

Implement the gradient descent update rule. Don't use any TensorFlow build-in optimizer!
- Use tf.gradients for computing the gradient.
- tf.assign for updating.
Iteratively apply the update rule to minimize the loss.
Train for 100 epochs
Use minibatches with size 50
Keep track of the costs after each epoch
Decide about an appropriate learning rate

Reminder:

Equation for the update rule:

$ \begin{align} W' & = W - \alpha \cdot \frac{\partial}{\partial W} J(W, b)\\\\ b' & = b - \alpha \cdot \frac{\partial}{\partial b} J(W, b) \end{align} $

### Complete this cell

nb_epochs = 100
minibatch_size = 50
learning_rate = 1337 ### Decide about an appropriate learning rate

cost_per_epoch = []

### Your code ...

Plot

Cost (Loss) over Iterations

Plot of the cost progress vs. iterations.

The output should look similar to the following:

plt.plot(range(len(cost_per_epoch)), cost_per_epoch)
plt.xlabel('# of iterations')
plt.ylabel('cost')
plt.title('Learning Progress')

Decision Boundary After Training

The following function plots the data with the decision boundaries after the training. The model should be trained well enough to seperate most (roughly ~95%) of the data correctly. Use the following code for plotting.

The output should look similar to the following:

def plot_decision_boundary(iteration=None, x_min=-10, x_max=14, y_min=-10, y_max=10):    
    fig = plt.figure(figsize=(8,8))
  
    plt.xlim(x_min, x_max)
    plt.ylim(y_min, y_max)

    delta = 0.1
    a = np.arange(x_min, x_max+delta, delta)
    b = np.arange(y_min, y_max+delta, delta)
    A, B = np.meshgrid(a, b)

    x_ = np.dstack((A, B)).reshape(-1, 2)

    out = h.eval(feed_dict={x: x_})

    ns = list()
    ns.append(3)
    ns.extend(A.shape)
    out = out.T.reshape(ns)

    plt.pcolor(A, B, out[0], cmap="Blues", alpha=0.2)
    plt.pcolor(A, B, out[1], cmap=('Oranges'), alpha=0.2)
    plt.pcolor(A, B, out[2], cmap=('Greens'), alpha=0.2)
    # lets visualize the data:
    plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)

    plt.title("Decision boundaries in data space.")

plot_decision_boundary()

Decision Boundary Before Training

Now we reinitialize our model's variables to visualize how the decision boundaries might have been before the training. Since we initilize our weights with tf.random_uniform this will look different for every execution.

sess.run(init_op)
plot_decision_boundary()

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

HTW Berlin - Angewandte Informatik - Advanced Topics - Exercise - Multiclass Logistic Regression (Softmax) with tensorflow
by Christian Herta, Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.