Batch Normalization

Reasons for using Batch Normalization:

If we are training on data with different distributions for each feature. The cost function is going to be affected drastically. As you can see in the image the feature size of the cat and color grading of the cat have different distributions and each different distribution is explained with some different colors in the cost function.

This causes the cost function to become elongated and to fail to update weights properly leading to poor training or late convergence.

Since each input impacts the cost function differently, the chance of overall training weights will be different as it moves through the cost function.

Another reason for using batch normalization:

Covariate Shift:

If there is a shift in the distribution of the test data from the training data then the weights that were learned during training can no longer sustain.

After Batch Normalization:

Here batch normalization means that since we divided data into a set of batches for training each data point is normalized based on the means and standard deviation of data points in the batch. Now that all the data is normalized with a mean and standard deviation of 0 and 1 the training happens smoothly and avoids our first problem.

For test data, the data is normalized by considering all the learned information and statistics from train data. One simple approach could be to use averaged mean and standard deviation across all batches of train data on test data to make it have the same distribution of train data to avoid problem 2.

Note:

This can also avoid the effect of chance in weights after each epoch. So batch normalization tries to normalize the input coming into each node based on the info statistics for each batch same as input data during training.

Procedure and formulation behind batch Normalization:

If we assume batch size as 32, we optimally try to normalize the input coming to each neuron as given in the image the input from all previous layers was usually calculated as 'z', so for each input, there is one 'z 'making 32 just values for batch and these 'z' values were normalized as below

Here epsilon helps to avoid the denominator turning into 0.

After this, we finally obtain the optimal value by applying this formula

These scale and shift factors are learning parameters during training, that help adjust the inputs to the right distribution rather than some simple distribution.

Same with testing but the only difference is that we use the running mean and started deviation learned during training on testing

NOTE THAT THIS IS THE SAME IN THE CASE OF PIXEL DATA SUCH AS IMAGES WHERE EACH PIXEL WILL ACT AS ONE FEATURE.

Machine learning guider

Search This Blog