Weight Initialization (Differentiable Programming)

Introduction
Requirements
- Prerequisites
- Python Modules
Variance
Weight initialization by considering only the forward pass
Exercises
Literature
Licenses

Introduction

This notebook deals with parameter initialization in neural nets. Weights that start off too small or too large can cause gradients to vanish or explode, which is detrimental to the learning process. Xavier initialization aims to keep activations and gradients flowing in the forward and backward pass.

In this notebook, you'll compare different initialization techniques and study their effect on the network. Finally you'll implement a mechanism for custom weight initialization for the neural net library you've been building in this course.

Requirements

Knowledge

A recommended read on network initialization is the blog post Initialization of deep networks by Gustav Larsson (#LAR15)

Prerequisites

This notebook uses the neural net framework you've been building in the 'Differentiable Programming' course - but you can use the implementation in dp.py. If dp.py is located in the same folder as this notebook, you can access it as a module with import dp

Python Modules

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets,preprocessing
from dp import Model,Node,SGD, Adam

Variance

The variance of the product between two independent variables is:

$ {\rm Var}(XY) = E(X^2Y^2) − (E(XY))^2={\rm Var}(X){\rm Var}(Y)+{\rm Var}(X)(E(Y))^2+{\rm Var}(Y)(E(X))^2 $

Goodman, Leo A., "On the exact variance of products," Journal of the American Statistical Association, December 1960, 708–713.

with zero-mean variables: $ E(X) = E(Y) = 0 $ this is

$ {\rm Var}(XY) = {\rm Var}(X){\rm Var}(Y) $

Weight initialization by considering only the forward pass

weight matrix$ W $ consists of$ m $
- column vectors$ \vec w_i $ (neuron weights for a hidden neuron$ i $,$ m $ hiddens for the layer in total)
- each element was drawn from an IID Gaussian with variance$ var(W) $.
input vector of one example (also hidden vector)$ \vec x^T $ with expected variance$ var (X) $
for random initialization there is no correlation between the input and the weights
both should be approximatly zero-mean (through initialization resp. data preprocessing for$ \vec x $)

So we use -$ n $ is also called the "fan out" of a layer -$ m $ is the "fan in" of a layer

Now we want that the variance remains constant, i.e. same variance for input and output in the linear regime. So the following expression should be 1:

$ \frac{\text{var}(\vec x^T \cdot \vec w_i)}{\text{var}(X)} = \frac{\text{var} (\sum_{j=1}^n x_j w_{ji})}{\text{var}(X)}= \frac{n {\ }\text{var}(X) \text{var}(W)}{\text{var}(X)} = n {\ }\text{var}(W) = 1 $

i.e.:

va$ (W) = 1/n $
resp.
st$ (W) = 1/\sqrt n $

With ReLu-Units only half of the units are in the acitive regime. So the variance of$ W $ must be twice to yield the same effect, i.e.:

va$ (W) = 2/n $
resp.
st$ (W) = \sqrt{2/n} $

For training we do a forward pass and a backward pass. In the backward pass the error signal is "linearly" backpropagated.

Glorot et al. suggest taking the average between forward and backward pass for initialization, i.e.:

va$ (W) = 2/(n + m) $
resp.
st$ (W) = \sqrt{2/(n+m)} $

For ReLU's:

va$ (W) = 4/(n + m) $
resp.
st$ (W) = \sqrt{4/(n+m)} $

Exercises

Forward pass

In this exercise we have some contrived train_data (1000 samples with 500 features). Implement the forward pass.

In each layer, the number of input and output features should remain the same.
The weights in each layer are drawn from the uniform distribution$ [-1 .. 1] $
Each layer uses the tanh activation function.
Return the activations across all layers

(We'll focus on the parameters xavier and gain in the next step)

train_data = np.random.randn(1000,500)

def feed_forward(x,num_layers,xavier=False,gain=np.sqrt(2)):
    raise NotImplementedError()

Visualise the activations

Feed your train_data through the forward pass with 10 layers. Then plot the distribution of the activation values of each layer in a histogram. What does this tell you about the saturation of the network?

# Some sample code that plots multiple histograms

plt.figure(figsize=(40,20))

def plot(activations):
    plt.figure(figsize=(40,20))  
    for i in range(10):
        plt.subplot(3,4,i+1)
        plt.hist(np.geomspace(0.1,2))
        
plot(feed_forward(train_data,num_layers=10))

Xavier Initialization

Update your implementation of feed_forward. If the parameter xavier is set to True, initialize all weights with Xavier initialization. Glorot et al. suggests the following normalized initialization

$ W \sim U \left[ - \frac{\sqrt6}{\sqrt{(fan\_in+fan\_out)}} , \frac{\sqrt6}{\sqrt{(fan\_in+fan\_out)}} \right] $

To put it into words: fan_in and fan_out are the number of input features and output features. Weights are drawn from the uniform distribution -sqrt(6/(fan_in+fan_out)) to sqrt(6/(fan_in+fan_out)), multiplied by a constant gain.

Repeat the forward pass and plot the activations using Xavier initialization - Does this solve the problem of saturation?

plot(feed_forward(train_data,10,xavier=True))

Using the neural net framework

Now we turn towards implementing custom initializers in the neural net framework.

First, create a model for a classification problem. We'll use the breast cancer dataset. Define the following architecture:

First layer: Linear, 30 input features, 20 output features, tanh activation
Second layer: Linear, 20 input features, 10 output features, tanh activation
Third layer: Linear, 10 input features, 1 output feature, sigmoid activation
For the loss function, use cross-entropy.

Note: Your neural net implementation may not have a Tanh_Layer function to return a layer that performs a matrix multiplication followed by a tanh function. But equivalently, you can use a linear layer and apply the activation function in the forward pass, for example:

def __init__():
    self.hidden0 = self.Linear_Layer(...)
   
def forward(self,x):
    return self.hidden0(x).tanh()```

x_train,y_train = datasets.load_breast_cancer(return_X_y=True)
x_train = preprocessing.scale(x_train)
#print(x_train)
#print(y_train)
print(x_train.shape)

class Net(Model):
    def __init__(self):
        super(Net,self).__init__()
        # create layers
        raise NotImplementedError()
        
    def loss(self,x,y):
        if not type(y) == Node:
            y = Node(y)
        # compute and return cross entropy loss, accumulated over all samples
        raise NotImplementedError()
        
    def forward(self, x):
        if not type(x) == Node:
            x = Node(x)
        # implement the forward pass
        # hidden_0 -> tanh -> hidden_1 -> tanh -> hidden_2 -> sigmoid
        raise NotImplementedError()

Implement Initializer

Initializer is an abstract class. Its method initialize iterates over all weights and biases in the network and sets their values.

Any subclass represents a specific initialization method, e.g. Xavier. A subclass implements the methods initial_weights(self, fan_in, fan_out) and initial_bias(self, fan_in, fan_out). The arguments fan_in and fan_out are the number of input and output features of the layer. The functions return initialized weights and bias suited for the layer, respectively.

class Initializer():
    def __init__(self):
        pass
        
    def initialize(self,net):
        for k,v in net.get_param().items():
            fan_in,fan_out = v.shape
            if 'weight' in k:
                W = self.initial_weights(fan_in,fan_out)
                np.copyto(v, W)
            elif 'bias' in k:
                b = self.initial_bias(fan_in, fan_out)
                np.copyto(v, b)
                
    def initial_weights(self, fan_in, fan_out):
        raise NotImplementedError('Must be implemented by subclass')
    
    def initial_bias(self, fan_in, fan_out):
        raise NotImplementedError('Must be implemented by subclass')

Task: Implement a few different initializers.

LowInitializer: initializes all parameters close to 0
LargeInitializer: initializes parameters at a large value e.g. random numbers drawn from the uniform distribution [-100..100]
NormalInitializer: initializes parameters with values drawn from a normal distribution (as opposed to a uniform distribution)
XavierInitializer: initializes parameters using Xavier initialization.

class LowInitializer(Initializer):
    def initial_weights(self, fan_in, fan_out):
        pass
    def initial_bias(self, fan_in, fan_out):
        pass
    
class LargeInitializer(Initializer):
    def initial_weights(self, fan_in, fan_out):
        pass
    def initial_bias(self, fan_in, fan_out):
        pass
    
class NormalInitializer(Initializer):
    def initial_weights(self, fan_in, fan_out):
        pass
    def initial_bias(self, fan_in, fan_out):
        pass
    
class XavierInitializer(Initializer):
    def initial_weights(self, fan_in, fan_out):
        pass
    def initial_bias(self, fan_in, fan_out):
        pass

Repeat the training process with different initializers applied to the network. Compare how well the network learns.

net = Net()

#LowInitializer().initialize(net)
LargeInitializer().initialize(net)
#XavierInitializer().initialize(net)
#NormalInitializer().initialize(net)
optimiser = Adam(
    net,
    x_train=x_train,
    y_train=y_train,
    hyperparam = {"alpha": 0.01}
)
optimiser.train(steps=100,print_each=10);

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Notebook title
by Benjamin Voigt, Diyar Oktay
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.