Batch Norm (Differentiable Programming)

Introduction
Requirements
- Knowledge
- Python Modules
Data
Batch normalization
Exercises
Literature
Licenses

Introduction

This notebook deals with 'Batch Normalization'. You're likely familiar with feature scaling - transform all features of the input data to roughly the same range before feeding it into the network.

Batch normalization takes this a step further and performs normalization on the activations at each layer.

Requirements

Knowledge

It's not required to study these resources before tackling this notebook but they provide an excellent coverage of the topic.

The original Batch Norm paper by S. Ioffe/C. Szegedy [IOF15]
The write-up Batch Norm layer by Leonardo Araujo dos Santos [ARA18]

Python Modules

import dp
from dp import NeuralNode,Node
import numpy as np
from sklearn import datasets,preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Data

This cell downloads the breast cancer dataset provided by sklearn.

x,y = datasets.load_breast_cancer(return_X_y=True)
x = preprocessing.scale(x)
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

Batch normalization

Imagine a deep neural network attempting to learn. Before we feed our data into the network, we normalize each dimension to have a mean of 0 and a standard deviation of 1. To do so we subtract the expected value (mean)$ E $ and divide by the sqrt of the variance$ Var $ of each dimension. $ x^{norm} = \frac{x - E[x]}{\sqrt{Var(x)}} $ It's common to add a small number$ \epsilon $ to the variance just to prevent taking the square root of zero. In python code:

# data
foo = np.random.randn(1000,20)

# mean and variance per feature
mean = foo.mean(axis=0,keepdims=True)
var = foo.var(axis=0,keepdims=True)
epsilon = 1e-8

# normalization
foo_norm = (foo - mean)/np.sqrt(var + epsilon)
foo_norm.mean(), foo_norm.std()

So the first layer is happy and content, its input is always normalized. But what about the deeper layers? After the data has gone through multiple matrix multiplications in the network its mean and standard deviation have likely shifted.

On top of that, in each iteration of learning, the parameters of the first layer change, so the succeeding layers try to learn on data with constantly shifting mean and variance.

To make things easier for the deeper layers, batch normalization is applied. Each layer calculates the mean and variance of each feature over the mini-batch of samples x. Each sample is then normalized as

x_norm = (x - batch_mean)/batch_variance

Then, we essentially provide a customizable standard deviation through the hyperparameters$ \gamma $ (gamma) and$ \beta $ (beta)

out = gamma * x_norm + beta

These parameters$ \gamma $ and$ \beta $ are learnable in the training process. You update them as you would any other parameter such as the weight of a linear layer, e.g. with SGD, Momentum or Adam.

Exercises

Bias

Task:

A linear layer generally performs the forward pass$ x \cdot W + b $. Show that if we apply batch norm after a linear layer, we can omit the bias term$ b $.

Implement batch norm

We'll use the Node class in dp.py for automatic differentiation and create a model for the breast cancer dataset. The architecture should look as follows

data(30 features) -> linear(30,20) -> batch norm -> tanh -> linear(20,10) -> batch norm -> tanh -> linear(10,1) -> sigmoid

Task:

Implement the method batch_norm to add a batch norm layer to the network.

class Model():
    # define layers of the model
    def __init__(self):
        self.params = dict()
        self.fc1 = self.linear(30,20,'fc1')
        self.bn1 = self.batch_norm(20,'bn1')
        self.fc2 = self.linear(20,10,'fc2')
        self.bn2 = self.batch_norm(10,'bn2')
        self.fc3 = self.linear(10,1,'fc3')
    
    # define forward pass
    def forward(self,x,train=True):
        if not type(x) == Node:
            x = Node(x)
        x = self.fc1(x)
        x = self.bn1(x).tanh()
        x = self.fc2(x)
        x = self.bn2(x).tanh()
        x = self.fc3(x)
        out = x.sigmoid()
        return out
        
    # define loss function
    def loss(self,x,y):
        out = self.forward(x,train=True)
        if not type(y) == Node:
            y = Node(y)
        loss = -1 * (y * out.log() + (1 - y) * (1 - out).log())
        return loss.sum()
    
    # add a linear layer to the model
    def linear(self, fan_in,fan_out,name):
        W_name, W_value = f'weight_{name}', np.random.randn(fan_in,fan_out)
        self.params[W_name] = W_value
        
        def forward(x):
            return x.dot(Node(self.params[W_name], W_name))

        return forward

def batch_norm(self, fan_in, name):
    # TODO: add gamma and beta of this layer to self.params
    # TODO: define and return the forward pass, i.e. a function that
    #       applies batch norm to x
    raise NotImplementedError()
    
Model.batch_norm = batch_norm

Verify if your implementation works properly. The output of the batch norm layer should have a mean of beta and a standard deviation of gamma.

net = Model()
assert 'gamma_bn1','beta_bn1' in net.params
assert 'gamma_bn2','beta_bn2' in net.params

data = np.random.randn(100,10)
out = net.bn2(Node(data)).value
print(out.mean(axis=0))
print(out.std(axis=0))
assert np.allclose(out.mean(axis=0), net.params['beta_bn2'], atol=1e-5)
assert np.allclose(out.std(axis=0), np.abs(net.params['gamma_bn2']), atol=1e-5)

New data

Recall: During the training process, you calculate the mean and variance of each feature over the mini batch of samples, then normalize each sample as

x_norm = (x - batch_mean)/batch_variance.

After training is completed, you may want to classify a single sample or a whole dataset, so there are no mini-batches.

To account for this, during the learning process you keep track of the moving average of the batch mean and batch variance. This moving average is then applied to normalize non-train data. If moving averages are new to you you may want to check out this Notebook on optimizers.

Task:

Change your implementation of the batch_layer method to add avg_mean_{layer_name} and avg_variance_{layer_name} to the parameters of the model. For each mini-batch that the network sees during training, update the parameters as the moving average of the batch mean and batch variance, respectively.

Note: The forward function returned by the batch_norm method needs a parameter such as train to distinguish between train batches and test samples.

def batch_norm(self, fan_in,name):
    raise NotImplementedError()

    def forward(x,train=True):
        raise NotImplementedError()
    
    return forward
    
Model.batch_norm = batch_norm

Verify your implementation: This tests feeds data with a mean of 42 and a standard deviation of 10 through the batch norm layer many times and checks the moving averages the layer has learned.

net = Model()
assert np.all(net.params['avg_mean_bn1'] == 0)
assert np.all(net.params['avg_variance_bn2'] == 0)
for i in range(100):
    data = Node(np.random.normal(loc=42,scale=10,size=((1000,20))))
    net.bn1(data)
np.testing.assert_allclose(42, net.params['avg_mean_bn1'], atol=1)
np.testing.assert_allclose(10**2, net.params['avg_variance_bn1'], atol=5)

Gradient descent

This cell creates a simple training loop to train and then test the model.

net = Model()

lrate = 0.01
batch_size = 50
steps = 100

# training
for i in range(steps):
    minis = np.random.choice(np.arange(len(x_train)),size=batch_size, replace=False)
    x_mini = x_train[minis,:]
    y_mini = y_train[minis]
    loss = net.loss(x_mini,y_mini)
    grads = loss.grad(1)
    new_params = { k : net.params[k] - lrate * grads[k]
                 for k in grads.keys() }
    net.params.update(new_params)
    
# testing
pred = np.round(net.forward(x_test).value).squeeze()
np.mean(pred == y_test)

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Batch Normalization
by Diyar Oktay
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Batch Norm (Differentiable Programming)

Table of Contents

Introduction

Requirements

Knowledge

Python Modules

Data

Batch normalization

Exercises

Bias

Implement batch norm

New data

Gradient descent

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

Code License (MIT)