ALEXNET PAPER BREAK DOWN

KEY POINTS:

NOTE: All the information on this page is sourced by using this paper https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

This is my way of writing notes hope it can help others for a brief view of the paper.

DATASET:

It is tested on both ILSVRC-2010 and 2012, while 2010 has labeled test data.
ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of 256 × 256
Each image is centered Raw RGB Image. The centering process can be expressed mathematically as follows:

Centered Image = Raw Image − Mean Value
For each channel (R, G, B), the mean value is calculated across all pixels in that channel, and then this mean value is subtracted from each pixel in that channel.

ARCHITECTURE:

It contains eight learned layers — five convolutional and three fully-connected
Regarding training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0, x).

Saturating nonlinearities, such as the sigmoid and hyperbolic tangent (tanh) functions, have a limited output range. As the input to these functions becomes very positive or very negative, the output saturates, meaning it approaches the upper or lower bounds of the function. In the saturated regions, the gradient (derivative) of the function becomes very small, leading to the vanishing gradient problem. This issue can make training slower and more challenging, especially in deep networks, as weight updates become very small and may not contribute significantly to learning.

Non-saturating nonlinearities, like the rectified linear unit (ReLU), do not saturate for positive inputs. ReLU sets negative inputs to zero and passes positive inputs unchanged. Since ReLU does not saturate in the positive region, it helps mitigate the vanishing gradient problem to some extent. Training with non-saturating nonlinearities can be faster because the gradients in the non-saturated regions allow for more effective weight updates during backpropagation.

GPU:

Trained on multiple GPUs ( 2). The parallelization scheme that we employ puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU. Note: It is not very important.

LOCAL RESPONSE NORMALIZATION

ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still find that the following local normalization scheme aids generalization. Denoting by ai(x,y) the activity of a neuron computed by applying kernel i at position (x, y) and then applying the ReLU nonlinearity, the response-normalized activity bi(x,y) is given by the expression

where the sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels. The constants k, n, α, and β are hyper-parameters whose values are determined using a validation set.

OVERLAPPING POOLING:

The usage of overlapping pooling is nothing but keeping the stride < less than the size of the pooling filter that is if the size of the pooling filter is z x z then we try to keep the s value lower than z to get overlap while pooling.

OVERFITTING:

The overfitting is avoided by using Data Augmentation techniques like

image translations and horizontal reflections.
The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set.

The second approach that they employed is Dropout with p=0.5.

LEARNING APPROACH:

The model is trained using SGD with momentum with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error. The update rule for weight w was

Weights initialization:

The weights in each layer are from a zero-mean Gaussian distribution with a standard deviation of 0.01. We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0.

Learning rate:

used an equal learning rate for all layers, which we adjusted manually throughout training. The heuristic that they followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and reduced three times before termination.

Conclusion:

a deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning. Notably, our network’s performance degrades if a single convolutional layer is removed. For example, removing any of the middle layers results in a loss of about 2% for the top-1 performance of the network. So depth really is important for achieving our results.

Machine learning guider

Search This Blog