Reasons for using Batch Normalization:
If we are training on data with different distributions for each feature. The cost function is going to be affected drastically. As you can see in the image the feature size of the cat and color grading of the cat have different distributions and each different distribution is explained with some different colors in the cost function.
This causes the cost function to become elongated and to fail to update weights properly leading to poor training or late convergence.
Since each input impacts the cost function differently, the chance of overall training weights will be different as it moves through the cost function.
Another reason for using batch normalization:
Covariate Shift:
If there is a shift in the distribution of the test data from the training data then the weights that were learned during training can no longer sustain.
After Batch Normalization:
Here batch normalization means that since we divided data into a set of batches for training each data point is normalized based on the means and standard deviation of data points in the batch. Now that all the data is normalized with a mean and standard deviation of 0 and 1 the training happens smoothly and avoids our first problem.
For test data, the data is normalized by considering all the learned information and statistics from train data. One simple approach could be to use averaged mean and standard deviation across all batches of train data on test data to make it have the same distribution of train data to avoid problem 2.
For test data, the data is normalized by considering all the learned information and statistics from train data. One simple approach could be to use averaged mean and standard deviation across all batches of train data on test data to make it have the same distribution of train data to avoid problem 2.
Note:
This can also avoid the effect of chance in weights after each epoch. So batch normalization tries to normalize the input coming into each node based on the info statistics for each batch same as input data during training.
Procedure and formulation behind batch Normalization:
After this, we finally obtain the optimal value by applying this formula
These scale and shift factors are learning parameters during training, that help adjust the inputs to the right distribution rather than some simple distribution.
Same with testing but the only difference is that we use the running mean and started deviation learned during training on testing
Comments
Post a Comment