What is the identity block in ResNet

ResNet-V1 (2015)

ResNet won first place in the classification task ILSVRC 2015.

ResNet is mainly used for solving deep networks.DismantlingProblem. The problem with degradation is that as the depth of the network increases, the accuracy rate saturates (which may not be surprising) and then quickly decreases. This is not caused by overfitting as the training accuracy rate also decreases.

Consider adding a flat networkIdentity mappingTo become a deeper version of the network, a deeper model should not produce a higher training error rate than its shallower version. The author's idea is to generate approximate identity mappings.

The authors solve the degradation problem by introducing a deep residual learning framework.The author concludes that training residuals is easier than training the original functions
If the network does not degrade, the deeper the network, the better the performance. At least it shouldn't be worse than the flat network.

Suppose F (x) is the remainder function and H (x) is the original function of the identity map. Then F (x) = H (x) -x, so H (x) = F (x) + x.
Since H (X) is not easy to train directly, we train F (x) + x instead.

This module corresponds to an identity mapping function H (x).

The following three models are VGG, 34-Layer Original Network, and ResNet34:

Above: Left: VGG-19 model (19.6 billion FLOPs) for reference. Medium: The simple network contains 34 parameter layers (3.6 billion FLOPs). Right: Remaining network with 34 parameter layers (3.6 billion FLOPs).

If the dimensions are unchanged, the links are equivalent to the solid line, and if the size is decreased, the dimension is increased according to the dotted line. Both the conv and the shortcuts on the dotted line have a step of 2 (downsampling).

It should be noted that the number of channels before and after the links that correspond to the dashed lines are not the same. How do you calculate the addition?

The author lists three options: A, B and C:

  • (A) Use 0 to fill additional channels when the channel changes. All links have no parameters.
  • (B) If the channel changes, use mapping links (use 1x1 convolution to adjust the number of channels), and others use identity links.
  • © All assign links.

The author did a test. The C schema is better than B, but to reduce complexity and model size, option C is not used.

ResNet also has a number of variants:

Considering the limitation of the training time, the author changed the building block in ResNet50 and subsequent networks to a bottleneck structure, that is, by reducing the 1x1 dimension, completing the 3x3 convolution and then using the ascending 1x1 dimension. The fine structure resembles a bottleneck, hence the name.

After testing, ResNet solved the deterioration problem:

Although the depth of the layers has increased significantly, the computational complexity of ResNet's 152 layers (11.3 billion FLOPs) is still much smaller than that of VGG-16 (15.3 billion FLOPs) and VGG-19 (19.6 billion FLOPs).

In this article the author analyzes the arithmetic spread behind residual blocks and designs a new residual block structure. The original RestNet network is overpatched on approximately 200 layers, while the new network is not overpassed on 1000 layers. This shows that the new network has a stronger generalization ability compared to the original ResNet.

The author believes that linking is the most direct way of conveying information. Operations in logic connections (scaling, gating, 1 × x 1 convolution and failure) impede the transmission of information and make optimization more difficult.

The comparison and test of the new and old remainder blocks are as follows:

In the above figure, a is the old remainder block and b is the new remainder block. It can be seen that after the abbreviation of b there is no relu, but there is a "pre-activation" consisting of bn and relu in front of the weight.

Understand mathematically

The positive result of each residual unit is the sum of the previous output of all residual functions (plus x_0), which indicates that all units L and l_1 have residual properties.

During back propagation, the gradient is broken down into two parts: the part that does not contain a weight layer and the part that is passed through the weight layer. The first part guarantees that the information can be sent back directly to a flat layer and that the gradient cannot disappear.

The author proves theoretically and experimentally that any operation on the link is redundant and the pre-activation is effective.

Test result

Comparison of the results of individual plants:

Cocoa for Improvement by the author has obvious effects.

ReNeXt is runner-up in the ILSVRC2016 classification competition.

Research in the field of visual recognition is in the transition from "feature engineering" to "network engineering". Researchers are now focusing on designing network architectures that can be used to learn better representations.

The success of the VGG network shows that a simple but effective strategy (stacking the basic components of the same structure) can also be used to build a deeper network. This strategy is also used in ResNet. The stacked blocks in ResNet also have the same topology. Simple design rules can reduce the selection of hyperparameters.

By carefully designing the topology of the network, the Inception model also achieved a high level of accuracy while keeping the complexity of the model low.

However, the implementation of the Inception model is cumbersome, it contains a number of complex hyperparameters - the size and number of filters for each transformation must be specified, and the modules in different phases must also be adjusted. Too many hyperparameters mainly influence factors. It becomes unclear how the Inception model can be adapted to different data sets / tasks.

This article also makes use of the strategy of reusing the same structural module and the split transformation merge of the Inception model in the VGG / ResNet network to precisely build a deep network, as shown in Figure 1 on the right. Such a design can freely adjust the scale of the transformation.

The author has generated a new basic component compared to the classic remainder module as follows:

You can see that on the right side of the above figure, each ReNeXt module has two parameters:

  1. CardinalityThis is the number of channels of a ReNeXt module that this article will focus on.
  2. The width of each channel, the width of the above image is 4

Using the author's illustration, the above image is a 32x4d ReNeXt module, where 32 is the base and 4 is the width.

The following figure shows the structure comparison between ResNet50 and ReNeXt50. The parameters and calculations of the two networks are basically the same.

The author examines the power of increasing depth, width and base:

The results showIncreasing cardinality to improve network capabilities is more effective than depth and breadth

It can be seen from Table 3 that if the width of the bottleneck is small, the increase in cardinality tends to be saturated for the model performance, so the selection of the width of the bottleneck is generally not less than 4 d.

Compared to other excellent networks, it has made great strides.

As can be seen in the following figure, ReNeXt50 generally achieves better results than ResNet101, which means that the same effect can be achieved with less computational effort.

(This section comes mainly from "Momentas in-depth Explanation of ImageNet 2017 SENet Architecture")

SENet won the image classification task of the last ImageNet 2017 competition with great advantage.

Much work has been done on the spatial dimension to improve network performance. It is therefore natural to consider whether the network can consider improving the performance at other levels, e.g. B. the relationship between feature channels. Our work is based on this and proposed squeeze-and-excitation networks (referred to as SENet).

How does this squeeze-and-excitation work? In particular, the importance of each feature channel is automatically obtained through learning, and according to this importance, useful features are promoted and features that are not useful for the current task are suppressed.

The following figure shows an example for SE embedded in the ResNet and Inception modules:

Here we are using global average pooling as a squeeze operation. Next, the two fully connected levels form a bottleneck structure to model the correlation between channels and output the same number of weights as the input features. We first reduce the feature dimension to 1/16 of the input, then activate it via ReLu and then return to the original dimension via a fully connected layer. The advantages of this approach over using a fully connected layer directly are: 1) It has higher non-linearity and can better match the complex correlation between channels. 2) It significantly reduces the number of parameters and calculations. Then get a normalized weight between 0 and 1 through a sigmoid gate and finally use a scaling operation to weight the normalized weight based on the characteristics of each channel.

At present, most mainstream networks are built by repeating based on the overlap of these two similar entities. It can be seen that the SE module can be embedded in almost all current network structures. By embedding the SE module in the building block unit of the original network structure, different types of SENet can be obtained. Such as SE-BN-Inception, SE-ResNet, SE-ReNeXt, SE-Inception-ResNet-v2 and so on.

The table above shows that the SE module can bring a gain in performance to the network regardless of the depth of the network. It's worth noting that SE-ResNet-50 can achieve the same accuracy as ResNet-101, and SE-ResNet-101 far outperforms the deeper ResNet-152.

Intensive reading Deep Learning Papers (5) ResNet V1

Deep residual learning for image recognition

Original article: Deep Residual Learning for Image Recognition

Intensive reading Deep Learning Papers (6) ResNet V2

ResNet V2 Original: Identity Mappings in Deep Residual Networks

Identity assignments in deep residual networks

Intensive Reading Deep Learning Papers (7) DenseNet

Tightly connected convolution networks

[DL architecture ResNet system] 003 ResNeXt

Detailed ResNeXt algorithm

ResNeXt paper: Aggregated Residual Transformations for Deep Neural Networks

SENe theses: Squeeze-and-Excitation Networks

Github : https: //github.com/hujie-frank/SENet

Momenta Details ImageNet 2017 Winner Architecture SENet

SE-Net: ImageNet Champion 2017