Dropout (Differentiable Programming)

Introduction
Requirements
Dropout
Exercises
Literature
Licenses

Introduction

This notebook walks you through an implementation of a regularization technique called dropout. The idea is that in each forward pass during training, we randomly select units to 'drop out' from the network, i.e. remove them from the network. This forces the surviving units to learn without depending too heavily on the cooperation of other units and produce better results individually.

Requirements

Knowledge

These are useful resources on the topic, though it's not required to read them entirely before tackling this notebook.

The original Dropout paper by Srivastava et al.[SRI14]
This blog post on dropout by Agustinus Kristiadi[KRI16]

Python Modules

import numpy as np

from dp import NeuralNode,Node

from sklearn import datasets,preprocessing
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

Data

This cell downloads the breast cancer dataset provided by sklearn.

x,y = datasets.load_breast_cancer(return_X_y=True)
x = preprocessing.scale(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

Dropout

Left: Units and connections in a neural net without dropout | Right: Units and connections of the net with dropout applied

Figure 1 from 'Dropout: A Simple Way to Prevent Neural Networks from Overfitting' #SRI14

The authors of the Dropout paper propose that a good way to reduce overfitting is to average out the predictions of many separately trained networks - but this is too computationally expensive to do in practice.

Introduce dropout: On the left, you see a network with all its units and their connections. On the right, the crossed out units have been dropped from the network along with all their connections. So it creates a new, 'thinned' version of the neural net.

When we send train samples through the network in the forward pass, we randomly sample units to drop from the network. So for each sample we train a 'thinned' version of the net. This approximates training and averaging many different neural nets with shared parameters.

The paper presents the following motivation (#SRI14 p. 1932/p. 4 in the PDF)

"Similarly, each hidden unit in a neural network trained with dropout must learn to work with a randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes. "

The following exercises walk you through an implementation of a dropout layer for a neural net.

Exercises

Mask operator

Task:

Implement a mask operator for the Node autodiff class.

Note: Remember to implement the partial derivative of the mask operator since it's crucial for backprop. If a unit is killed through dropout, it doesn't contribute anything to the network. So the gradient that flows back into it should be 0.

def mask(self, mask : np.ndarray):
    raise NotImplementedError()
    return

Node.mask = mask

Verify your solution.

# mask numbers 1..10 and square
a = Node(np.arange(1,11)[None,:], 'A')
mask = np.array([0, 1] * 5)
b = a.mask(mask).square()


# check gradients
grads = b.grad(np.ones(b.shape))['A']
assert grads[0,2] == 0
assert grads[0,3] == 8
assert grads[0,4] == 0

We'll again use the autodiff class to create a model for the breast cancer dataset. The method linear_layer adds a linear layer to the network, your task will be to implement a dropout layer.

class Model():
    # define layers of the model
    def __init__(self):
        self.params = dict()
        self.fc1 = self.linear(30,20,'fc1')
        self.do1 = self.dropout(keep_prob=0.5)
        self.fc2 = self.linear(20,10,'fc2')
        self.do2 = self.dropout(keep_prob=0.5)
        self.fc3 = self.linear(10,1,'fc3')
    
    # define forward pass
    def forward(self,x,train=True):
        if not type(x) == Node:
            x = Node(x)
        # TODO: implement forward pass
        raise NotImplementedError()
        return out
        
    # define loss function
    def loss(self,x,y,train=True):
        out = self.forward(x,train)
        if not type(y) == Node:
            y = Node(y)
        loss = -1 * (y * out.log() + (1 - y) * (1 - out).log())
        return loss.sum()
    
    # add a linear layer to the model
    def linear(self, fan_in,fan_out,name):
        W_name, W_value = f'weight_{name}', np.random.randn(fan_in,fan_out)
        b_name, b_value = f'bias_{name}', np.random.randn(1,fan_out)
        self.params[W_name] = W_value
        self.params[b_name] = b_value
        
        def forward(x):
            return x.dot(Node(self.params[W_name], W_name)) + Node(self.params[b_name], b_name)

        return forward
    
    # TODO: add dropout method

Forward pass

Task:

Implement the forward pass, e.g.

x -> linear -> tanh -> dropout
  -> linear -> tanh -> dropout
  -> linear -> sigmoid

Note: Mind the train parameter which indicates whether we're forwarding train or test data. On test data, you do not apply dropout.

Dropout layer

To apply dropout, we multiply the activations of a layer with a boolean/binary matrix of 0s and 1s (masking).

The hyperparameter$ p $ controls the percentage of units to keep. To make things more explicit, this parameter is also called keep_prob.

Each dropout layer can have a different setting for the keep_prob parameter$ \in $ [0..1]

Task:

Implement the dropout layer.

def dropout(self,keep_prob=0.5):
    raise NotImplementedError()

Model.dropout = dropout

Verify your implementation.

The first dropout layer has a keep_prob of 1.0, so all activations should survive.

The second dropout layer has a keep_prob of 0.5, so approximately half of them should be dead.

data = Node(np.random.randint(1,10,size=(10,10)))
out0 = dropout(None,keep_prob=1.0)(data)
out1 = dropout(None,keep_prob=0.8)(data)

# all units should survive
assert np.all(out0.value == data.value) 

# roughly 80% of units should survive
np.testing.assert_almost_equal(np.count_nonzero(out1.value)/out1.value.size, 0.8, decimal=1)

Expected value

Say we have a dropout layer with a keep_prob of 0.8, so only about 80% of the inputs survive. The expected value of the output is about 80% of that of the input.

At test time however, we don't apply dropout - So there's a scaling problem. The units receive test data which have a greater expected value than the train data they learned on.

To remedy this, the dropout layer applies the dropout mask, then multiplies the values by$ \frac{1}{keep\_prob} $ to correct the expected value. Or equivalently, multiply the mask itself by$ \frac{1}{keep\_prob} $.

Task:

Update your implementation to fix the expected value. Verify your implementation below.

net = Model()
data = Node(np.random.randint(1,100,size=(100,100)))
out = net.dropout(keep_prob=0.8)(data)

# mean of input and output should be similar
np.testing.assert_almost_equal(data.value.mean(), out.value.mean(), decimal=0)

The cell below executes a training loop which you can use to verify if your model learns appropriately.

# training
net = Model()
lrate = 0.002
batch_size = 75
test_losses = []
steps=100
    
for i in range(steps):
    minis = np.random.choice(np.arange(len(x_train)),size=batch_size, replace=False)
    x_mini = x_train[minis,:]
    y_mini = y_train[minis]
    loss = net.loss(x_mini,y_mini,train=True)
    grads = loss.grad(1)
    new_params = { k : net.params[k] - lrate * grads[k]
                 for k in grads.keys() }
    net.params.update(new_params)               
    test_losses.append(net.loss(x_test,y_test,train=False).value.item())

# testing
pred = np.round(net.forward(x_test,train=False).value.squeeze())
np.mean(pred == y_test)
plt.plot(test_losses)
plt.ylabel('loss on test set')
plt.xlabel('iterations');

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Dropout
by Diyar Oktay
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Dropout (Differentiable Programming)

Table of Contents

Introduction

Requirements

Knowledge

Python Modules

Data

Dropout

Exercises

Mask operator

Forward pass

Dropout layer

Expected value

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

Code License (MIT)