ZF NET PAPER BREAK DOWN

KEY POINTS:

Abstract:

They introduced a visualization technique that sheds light on the function of intermediate feature layers and the classifier's operation. Additionally, an ablation study was performed to understand the contribution of each model layer. The proposed method outperformed Alexnet on the Imagenet dataset and showed efficacy across various datasets.

Introduction:

The proposed visualization technique utilizes a multi-layered Deconvolutional Network (deconvnet), following (Zeiler et al., 2011), to project feature activations back to the input pixel space. A sensitivity analysis of the classifier output was conducted by occluding portions of the input image, revealing significant scene parts for classification.

Approach:

The approach mirrors a normal CNN.

Visualization:

To examine a convnet, a deconvnet is attached to each layer, creating a continuous path back to image pixels. Presenting an input image to the convnet computes features across layers. For convnet activation examination, they zeroed other layer activations and passed feature maps as input to the attached deconvnet layer. The process involves successive unpooling, rectifying, and filtering until reaching the input pixel space.
Unpooling: Switch variables were used during the non-invertible max pooling operation, preserving stimulus structure during unpooling.
Rectification: Reconstructed signals passed through a relu non-linearity for valid positive feature reconstructions at each layer.
Filtering:Transposed versions of filters applied to rectified maps were used for inversion. The discriminatively trained model implicitly showed discriminative parts of the input image.

Training details:

The model trained on the ImageNet 2012 training set. RGB images were preprocessed by resizing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean, and using 10 different sub-crops of size 224x224. Stochastic gradient descent with a mini-batch size of 128, starting with a learning rate of 10^-2, and a momentum term of 0.9 was used. Dropout in fully connected layers (6 and 7) had a rate of 0.5. Weights were initialized to 10^-2, and biases were set to 0.
Visualization of the first layer filters during training revealed dominance, addressed by renormalizing each filter in convolutional layers with an RMS value exceeding 10^-1 to this fixed radius. Multiple crops and flips of each training example were produced to boost the training set size. Training concluded after 70 epochs, taking around 12 days on a single GTX580 GPU.

Effect of transformations:

Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasilinear for translation & scaling. The network output is stable to translations and scalings. In general, the output is not invariant to rotation, except for object with rotational symmetry.

Feature Evolution:

The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

Note:

The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead” features.

Architectue Selection:

The first layer filters exhibit a combination of extremely high and low-frequency information, with limited coverage of mid-frequencies. Furthermore, the visualization of the second layer reveals aliasing artifacts attributed to the substantial stride of 4 utilized in the first layer convolutions. To address these issues, adjustments were made by (i) reducing the first layer filter size from 11x11 to 7x7 and (ii) modifying the convolution stride to 2 instead of 4. This revised architecture preserves significantly more information in the first and second layer features, as observed in the results. Importantly, these modifications also enhance the classification performance.

Occlusion Sensitivity:

The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded.

Correspondence Analysis:

In examining deep models lacking explicit mechanisms for object part correspondence, the authors conducted Correspondence Analysis. Using five dog images with masked facial parts, they measured the consistency of feature vectors at different layers. Lower scores for specific facial elements (e.g., left eye) compared to random object regions, especially in layer 5 features, suggest that the model implicitly establishes some level of correspondence, capturing consistent changes in feature representation when masking specific facial elements across different images.

Variying model sizes:

Removing fully connected layers 6 and 7 only caused a slight increase in the error, similarly removing middle convolution layers also only caused slight increase in error. But combing these two senarios caused significant rise in the error. This suggest that for the network it isa mandatory to be deeply connected to achieve the right result.

In contrast if we increse the number of fully connected layers, we just end up with slight decrerase in error, at the same time when we increased sizer of middle convolution layers we saw a useful drop in error , but it resulted in overfitting.

NOTE:

Higher layers generally produce more discriminative features.

Machine learning guider

Search This Blog