Skip to main content

VGG NET PAPER BREAKDOWN

 

KEY POINTS:

Introduction:

  • The main idea of this paper is to explore depth using CNN. previously we have seen other papers that tried to improve based on smaller window size, stride, and density of kernels.

  • They have used the filter size 3x3 in convolution to feasibly use more convolution layers.

  • They also stated that the 2 models they created have performed exceptionally well on other datasets even though they trained them on the Imagenet dataset.

Architecture:

  • The input size is 224x224 RGB image.
  • Every image is mean-centered.
  • using the small convolution size of 3x3. and in some configurations, we used 1x1.
  • The stride is set to 1. The padding is set to 1 for 3x3 conv layers.
  • pooling is carried out by 5 max-pooling layers with a filter size of 2x2 and a stride of 2.
  • 3 fully connected layers are present of which 2 have 4096 channels and one has 100.
  • all hidden layers use relu.
  • No normalization techniques were used.
  • they have tried different depths of convolution layers from 11-19 as below


  • In general usage of 2 3x3 filters is the same as using 1 5x5 and similarly using 3 3x3 is the same using 1 7x7. then the question arises why we have used 3 3x3x instead of 1 7x7? the answer to this is less the number of parameters in the network, for example, if we use 3 3x3 then we will have 3 x3^2 C^2 parameters (where C number of channels or kernels), but in the case of a 7x7 filter, we will have 7^2 x C^2  parameters which are 81% more parameters. 
  • The usage of 1x1 convolution is for increasing non-linearity, though the input and output size are the same in the conv part but due to the rectification part they will get introduced by non-linearity.

Training:

Image size and regularization:

The approach involves training a Convolutional Neural Network (ConvNet) using fixed-size crops (224 x 224 pixels) from training images. Two strategies for determining the training scale (S) are considered: single-scale training, where S is fixed at either 256 or 384, and multi-scale training, where S is randomly selected from a range (256 to 512) for each image. Single-scale training focuses on a consistent image scale, while multi-scale training adapts to various object sizes. Multi-scale models are fine-tuned from a single-scale model, initially trained with a fixed scale of 384, enabling the recognition of objects across a broad range of sizes. This approach aims to enhance the model's flexibility in handling diverse object scales during image classification tasks.

Training:

  • ConvNet training follows Krizhevsky et al. (2012) with mini-batch gradient descent (batch size: 256, momentum: 0.9).
  • Regularization includes weight decay (L2 penalty: 5 × 10^−4) and dropout (0.5 ratio) for the first two fully-connected layers.
  • The learning rate starts at 10^−2, decreasing by a factor of 10 on the validation plateau, with three reductions and training stopping at 370K iterations (74 epochs).
  • Faster convergence is attributed to implicit regularization from greater depth and smaller filter sizes, plus pre-initialization of certain layers.
  • A preliminary shallow model is initially trained with random weight initialization. This serves as a starting point for subsequent deeper architectures.
  • The shallow model is designed to be tractable with random initialization, allowing for stable training at this stage.
  • For random initialization, weights sampled from a normal distribution (mean: 0, variance: 10^−2), biases initialized to zero.
  • Fixed-size 224×224 ConvNet input images obtained by random cropping rescaled training images; augmentations include horizontal flipping and random RGB color shift for training set diversity.

Testing:

During testing, the input image is resized to a specific scale (Q), potentially different from the training scale. The Convolutional Neural Network (ConvNet) is then applied to the entire resized image, with adjustments like transforming fully connected layers to convolutional layers. This process generates a class score map, indicating the likelihood of the image belonging to different classes. The map is spatially averaged to produce a fixed-size vector of scores. Data augmentation involves horizontally flipping images in the test set, and the scores from both original and flipped images are averaged. The method prioritizes efficiency by using the entire image during testing, eliminating the need for multiple crops. Comparisons with other approaches demonstrate its efficiency without substantial loss in accuracy, even when considering evaluations with 50 crops per scale.


Comments