Residual neural network

A Residual Block in a deep Residual Network. Here the Residual Connection skips two layers.

A residual neural network (also referred to as a residual network or ResNet)^[1] is a seminal deep learning model in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge (ILSVRC).^[2]^[3]

ResNet behaves like a highway network whose gates are opened through strongly positive bias weights.^[4] This enables deep learning models with tens or hundreds of layers to train easily and approach better accuracy when going deeper. The identity skip connections, often referred to as "residual connections", are also used in the original LSTM network,^[5] transformer models (e.g., BERT and GPT models such as ChatGPT), the AlphaGo Zero system, the AlphaStar system, and the AlphaFold system.

Formulation[edit]

Background[edit]

In 2012, AlexNet^[6] was developed for ILSVRC (which it won), becoming a highly influential model in the development of deep learning for computer vision. AlexNet is an eight-layer convolutional neural network (CNN), and although CNNs had been around since at least LeNet in the 1990s,^[7] AlexNet helped pioneer their use in real-world applications through the use of GPUs to perform parallel computations and ReLU ("Rectified Linear Unit") layers to speed up gradient descent.

In 2014, VGGNet was developed by the Visual Geometry Group (VGG) at the University of Oxford, improving upon the AlexNet architecture. Whereas AlexNet used only a few convolutional layers with sometimes large kernels (up to $11\times 11$ ), VGGNet pioneered the use of deeper CNNs (i.e., ones with more layers) by using many smaller kernels ( $3\times 3$ ) stacked together.

However, stacking too many layers led to a steep reduction in training accuracy,^[8] known as the "degradation" problem.^[1] In theory, adding additional layers to deepen a network should not result in a higher training loss, but this is exactly what happened with VGGNet.^[1] If the extra layers can be set as identity mappings, though, then the deeper network would represent the same function as its shallower counterpart. This is the main idea behind residual learning, explained further below. It is hypothesized that the optimizer is not able to approach identity mappings for the parameterized layers.

Residual learning[edit]

In a multi-layer neural network model, consider a subnetwork with a certain number of stacked layers (e.g., 2 or 3). Denote the underlying function performed by this subnetwork as ${\textstyle H(x)}$ , where ${\textstyle x}$ is the input to the subnetwork. Residual learning re-parameterizes this subnetwork and lets the parameter layers represent a "residual function" ${\textstyle F(x):=H(x)-x}$ . The output ${\textstyle y}$ of this subnetwork is then represented as:

{\begin{aligned}y&=F(x)+x\end{aligned}}

The operation of " ${\textstyle +\ x}$ " is implemented via a "skip connection" that performs an identity mapping to connect the input of the subnetwork with its output. This connection is referred to as a "residual connection" in later work. The function ${\textstyle F(x)}$ is often represented by matrix multiplication interlaced with activation functions and normalization operations (e.g., batch normalization or layer normalization). As a whole, one of these subnetworks is referred to as a "residual block".^[1] A deep residual network is constructed by simply stacking these blocks together.

Importantly, the underlying principle of residual blocks is also the principle of the original LSTM cell,^[5] a recurrent neural network that predicts an output at time $t+1$ as ${\textstyle y_{t+1}=F(x_{t})+x_{t}}$ , which becomes ${\textstyle y=F(x)+x}$ during backpropagation through time.^[9]

Signal propagation[edit]

The introduction of identity mappings facilitates signal propagation in both forward and backward paths, as described below.^[10]

Forward propagation[edit]

If the output of the ${\textstyle \ell }$ -th residual block is the input to the ${\textstyle (\ell +1)}$ -th residual block (assuming no activation function between blocks), then the ${\textstyle (\ell +1)}$ -th input is:

{\begin{aligned}x_{\ell +1}&=F(x_{\ell })+x_{\ell }\end{aligned}}

Applying this formulation recursively, e.g., ${\begin{aligned}x_{\ell +2}=F(x_{\ell +1})+x_{\ell +1}=F(x_{\ell +1})+F(x_{\ell })+x_{\ell }\end{aligned}}$

yields the general relationship:

{\begin{aligned}x_{L}&=x_{\ell }+\sum _{i=\ell }^{L-1}F(x_{i})\\\end{aligned}}

where ${\textstyle L}$ is the index of a residual block and ${\textstyle \ell }$ is the index of some earlier block. This formulation suggests that there is always a signal that is directly sent from a shallower block ${\textstyle \ell }$ to a deeper block ${\textstyle L}$ .

Backward propagation[edit]

The residual learning formulation provides the added benefit of addressing the vanishing gradient problem to some extent. However, it is crucial to acknowledge that the vanishing gradient issue is not the root cause of the degradation problem, which is tackled through the use of normalization layers. To observe the effect of residual blocks on backpropagation, consider the partial derivative of a loss function ${\textstyle {\mathcal {E}}}$ with respect to some residual block input ${\textstyle x_{\ell }}$ . Using the equation above from forward propagation for a later residual block $L>\ell$ :^[10]

{\begin{aligned}{\frac {\partial {\mathcal {E}}}{\partial x_{\ell }}}&={\frac {\partial {\mathcal {E}}}{\partial x_{L}}}{\frac {\partial x_{L}}{\partial x_{\ell }}}\\&={\frac {\partial {\mathcal {E}}}{\partial x_{L}}}\left(1+{\frac {\partial }{\partial x_{\ell }}}\sum _{i=\ell }^{L-1}F(x_{i})\right)\\&={\frac {\partial {\mathcal {E}}}{\partial x_{L}}}+{\frac {\partial {\mathcal {E}}}{\partial x_{L}}}{\frac {\partial }{\partial x_{\ell }}}\sum _{i=\ell }^{L-1}F(x_{i})\\\end{aligned}}

This formulation suggests that the gradient computation of a shallower layer, ${\textstyle {\frac {\partial {\mathcal {E}}}{\partial x_{\ell }}}}$ , always has a later term ${\textstyle {\frac {\partial {\mathcal {E}}}{\partial x_{L}}}}$ that is directly added. Even if the gradients of the ${\textstyle F(x_{i})}$ terms are small, the total gradient ${\textstyle {\frac {\partial {\mathcal {E}}}{\partial x_{\ell }}}}$ resists vanishing thanks to the added term ${\textstyle {\frac {\partial {\mathcal {E}}}{\partial x_{L}}}}$ .

Variants of residual blocks[edit]

Two variants of convolutional Residual Blocks.^[1] **Left**: a *Basic Block* that has two 3x3 convolutional layers. **Right**: a *Bottleneck Block* that has a 1x1 convolutional layer for dimension reduction (e.g., 1/4), a 3x3 convolutional layer, and another 1x1 convolutional layer for dimension restoration.

Basic block[edit]

A Basic Block is the simplest building block studied in the original ResNet.^[1] This block consists of two sequential 3x3 convolutional layers and a residual connection. The input and output dimensions of both layers are equal.

Bottleneck block[edit]

A Bottleneck Block^[1] consists of three sequential convolutional layers and a residual connection. The first layer in this block is a 1x1 convolution for dimension reduction, e.g., to 1/4 of the input dimension; the second layer performs a 3x3 convolution; the last layer is another 1x1 convolution for dimension restoration. The models of ResNet-50, ResNet-101, and ResNet-152 in ^[1] are all based on Bottleneck Blocks.

Pre-activation block[edit]

The Pre-activation Residual Block^[10] applies the activation functions (e.g., non-linearity and normalization) before applying the residual function ${\textstyle F}$ . Formally, the computation of a Pre-activation Residual Block can be written as:

{\begin{aligned}x_{\ell +1}&=F(\phi (x_{\ell }))+x_{\ell }\end{aligned}}

where ${\textstyle \phi }$ can be any non-linearity activation (e.g., ReLU) or normalization (e.g., LayerNorm) operation. This design reduces the number of non-identity mappings between Residual Blocks. This design was used to train models with 200 to over 1000 layers.^[10]

Since GPT-2, the Transformer Blocks have been dominantly implemented as Pre-activation Blocks. This is often referred to as "pre-normalization" in the literature of Transformer models.^[11]

Transformer block[edit]

A Transformer Block is a stack of two Residual Blocks. Each Residual Block has a Residual Connection.

The first Residual Block is a Multi-Head Attention Block, which performs (self-)attention computation followed by a linear projection.

The second Residual Block is a feed-forward Multi-Layer Perceptron (MLP) Block. This block is analogous to an "inverse" bottleneck block: it has a linear projection layer (which is equivalent to a 1x1 convolution in the context of Convolutional Neural Networks) that increases the dimension, and another linear projection that reduces the dimension.

A Transformer Block has a depth of 4 layers (linear projections). The GPT-3 model has 96 Transformer Blocks (in the literature of Transformers, a Transformer Block is often referred to as a "Transformer Layer"). This model has a depth of about 400 projection layers, including 96x4 layers in Transformer Blocks and a few extra layers for input embedding and output prediction.

Very deep Transformer models cannot be successfully trained without Residual Connections.^[12]

Related Work[edit]

In 1961, Frank Rosenblatt described a three-layer multilayer perceptron (MLP) model with skip connections.^[13] The model was referred to as a "cross-coupled system", and the skip connections were forms of cross-coupled connections.

In two books published in 1994 ^[14] and 1996,^[15] "skip-layer" connections were presented in feed-forward MLP models: "The general definition [of MLP] allows more than one hidden layer, and it also allows 'skip-layer' connections from input to output" (p261 in,^[14] p144 in ^[15]), "... which allows the non-linear units to perturb a linear functional form" (p262 in ^[14]). This description suggests that the non-linear MLP performs like a residual function (perturbation) added to a linear function.

Sepp Hochreiter analyzed the vanishing gradient problem in 1991 and attributed to it the reason why deep learning did not work well.^[16] To overcome this problem, long short-term memory (LSTM) recurrent neural networks^[5] had skip connections or residual connections with a weight of 1.0 in every LSTM cell (called the constant error carrousel) to compute ${\textstyle y_{t+1}=F(x_{t})+x_{t}}$ . During backpropagation through time, this becomes the above-mentioned residual formula ${\textstyle y=F(x)+x}$ for feedforward neural networks. This enables training very deep recurrent neural networks with a very long time span t. A later LSTM version published in 2000^[17] modulates the identity LSTM connections by so-called forget gates such that their weights are not fixed to 1.0 but can be learned. In experiments, the forget gates were initialized with positive bias weights,^[17] thus being opened, addressing the vanishing gradient problem.

The highway network of May 2015^[4]^[18] applies these principles to feedforward neural networks. It was reported to be "the first very deep feedforward network with hundreds of layers".^[19] It is like an LSTM with forget gates unfolded in time,^[17] while the later Residual Nets have no equivalent of forget gates and are like the unfolded original LSTM.^[5] If the skip connections in Highway Networks are "without gates", or if their gates are kept open (activation 1.0) through strong positive bias weights, they become the identity skip connections in Residual Networks.

The original Highway Network paper^[4] not only introduced the basic principle for very deep feedforward networks, but also included experimental results with 20, 50, and 100 layers networks, and mentioned ongoing experiments with up to 900 layers. Networks with 50 or 100 layers had lower training error than their plain network counterparts, but no lower training error than their 20 layers counterpart (on the MNIST dataset, Figure 1 in ^[4]). No improvement on test accuracy was reported with networks deeper than 19 layers (on the CIFAR-10 dataset; Table 1 in ^[4]). The ResNet paper,^[10] however, provided strong experimental evidence of the benefits of going deeper than 20 layers. It argued that the identity mapping without modulation is crucial and mentioned that modulation in the skip connection can still lead to vanishing signals in forward and backward propagation (Section 3 in ^[10]). This is also why the forget gates of the 2000 LSTM^[17] were initially opened through positive bias weights: as long as the gates are open, it behaves like the 1997 LSTM. Similarly, a Highway Net whose gates are opened through strongly positive bias weights behaves like a ResNet. The skip connections used in modern neural networks (e.g., Transformers) are dominantly identity mappings.

DenseNets in 2016 ^[20] were designed as deep neural networks that attempt to connect each layer to every other layer. DenseNets approached this goal by using identity mappings as skip connections. Unlike ResNets, DenseNets merge the layer output with skip connections by concatenation, not addition.

Neural networks with Stochastic Depth ^[21] were made possible given the Residual Network architectures. This training procedure randomly drops a subset of layers and lets the signal propagate through the identity skip connection. Also known as "DropPath", this is an effective regularization method for training large and deep models, such as the Vision Transformer (ViT).

Biological relation[edit]

The original Residual Network paper made no claim on being inspired by biological systems. But research later on has related Residual Networks to biologically-plausible algorithms. ^[22] ^[23]

A study published in Science in 2023 ^[24] disclosed the complete connectome of an insect brain (of a fruit fly larva). This study discovered "multilayer shortcuts" that resemble the skip connections in artificial neural networks, including ResNets.

References[edit]

^ ^a ^b ^c ^d ^e ^f ^g ^h He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385.
^ "ILSVRC2015 Results". image-net.org.
^ Deng, Jia; Dong, Wei; Socher, Richard; Li, Li-Jia; Li, Kai; Fei-Fei, Li (2009). "ImageNet: A large-scale hierarchical image database". CVPR.
^ ^a ^b ^c ^d ^e Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (3 May 2015). "Highway Networks". arXiv:1505.00387 [cs.LG].
^ ^a ^b ^c ^d Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. S2CID 1915014.
^ Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (24 May 2017). "ImageNet classification with deep convolutional neural networks". Communications of the ACM. 60 (6): 84--90. doi:10.1145/3065386.
^ LeCun, Yann; Bottou, Léon; Bengio, Yoshua; Haffner, Patrick (November 1998). "Gradient-based learning applied to document recognition". Proceedings of the IEEE. 86 (11): 2278--2324. doi:10.1109/5.726791.
^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification". arXiv:1502.01852 [cs.CV].
^ Szegedy, Christian; Ioffe, Sergey; Vanhoucke, Vincent; Alemi, Alex (2016). "Inception-v4, Inception-ResNet and the impact of residual connections on learning". arXiv:1602.07261 [cs.CV].
^ ^a ^b ^c ^d ^e ^f He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015). "Identity Mappings in Deep Residual Networks". arXiv:1603.05027 [cs.CV].
^ Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilya (14 February 2019). "Language models are unsupervised multitask learners" (PDF). Archived (PDF) from the original on 6 February 2021. Retrieved 19 December 2020.
^ Dong, Yihe; Cordonnier, Jean-Baptiste; Loukas, Andreas (2021). "Attention is not all you need: pure attention loses rank doubly exponentially with depth". arXiv:2103.03404 [cs.LG].
^ Rosenblatt, Frank (1961). Principles of neurodynamics. perceptrons and the theory of brain mechanisms (PDF).
^ ^a ^b ^c Venables, W. N.; Ripley, Brain D. (1994). Modern Applied Statistics with S-Plus. Springer. ISBN 9783540943501.
^ ^a ^b Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. doi:10.1017/CBO9780511812651. ISBN 978-0-521-46086-6.
^ Hochreiter, Sepp (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (diploma thesis). Technical University Munich, Institute of Computer Science, advisor: J. Schmidhuber.
^ ^a ^b ^c ^d Felix A. Gers; Jürgen Schmidhuber; Fred Cummins (2000). "Learning to Forget: Continual Prediction with LSTM". Neural Computation. 12 (10): 2451–2471. CiteSeerX 10.1.1.55.5709. doi:10.1162/089976600300015015. PMID 11032042. S2CID 11598600.
^ Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (22 July 2015). "Training Very Deep Networks". arXiv:1507.06228 [cs.LG].
^ Schmidhuber, Jürgen (2015). "Microsoft Wins ImageNet 2015 through Highway Net (or Feedforward LSTM) without Gates".
^ Huang, Gao; Liu, Zhuang; van der Maaten, Laurens; Weinberger, Kilian (2016). Densely Connected Convolutional Networks. arXiv:1608.06993.
^ Huang, Gao; Sun, Yu; Liu, Zhuang; Weinberger, Kilian (2016). Deep Networks with Stochastic Depth. arXiv:1603.09382.
^ Liao, Qianli; Poggio, Tomaso (2016). Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex. arXiv:1604.03640.
^ Xiao, Will; Chen, Honglin; Liao, Qianli; Poggio, Tomaso (2018). Biologically-Plausible Learning Algorithms Can Scale to Large Datasets. arXiv:1811.03567.
^ Winding, Michael; Pedigo, Benjamin; Barnes, Christopher; Patsolic, Heather; Park, Youngser; Kazimiers, Tom; Fushiki, Akira; Andrade, Ingrid; Khandelwal, Avinash; Valdes-Aleman, Javier; Li, Feng; Randel, Nadine; Barsotti, Elizabeth; Correia, Ana; Fetter, Fetter; Hartenstein, Volker; Priebe, Carey; Vogelstein, Joshua; Cardona, Albert; Zlatic, Marta (10 Mar 2023). "The connectome of an insect brain". Science. 379 (6636): eadd9330. bioRxiv 10.1101/2022.11.28.516756v1. doi:10.1126/science.add9330. PMC 7614541. PMID 36893230. S2CID 254070919.

[resnet-1] ^ ^a ^b ^c ^d ^e ^f ^g ^h He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385.

[ilsvrc2015-2] "ILSVRC2015 Results". image-net.org.

[imagenet-3] Deng, Jia; Dong, Wei; Socher, Richard; Li, Li-Jia; Li, Kai; Fei-Fei, Li (2009). "ImageNet: A large-scale hierarchical image database". CVPR.

[highway2015may-4] Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (3 May 2015). "Highway Networks". arXiv:1505.00387 [cs.LG].

[lstm1997-5] Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. S2CID 1915014.

[6] Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (24 May 2017). "ImageNet classification with deep convolutional neural networks". Communications of the ACM. 60 (6): 84--90. doi:10.1145/3065386.

[7] LeCun, Yann; Bottou, Léon; Bengio, Yoshua; Haffner, Patrick (November 1998). "Gradient-based learning applied to document recognition". Proceedings of the IEEE. 86 (11): 2278--2324. doi:10.1109/5.726791.

[prelu-8] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification". arXiv:1502.01852 [cs.CV].

[inceptionv4-9] Szegedy, Christian; Ioffe, Sergey; Vanhoucke, Vincent; Alemi, Alex (2016). "Inception-v4, Inception-ResNet and the impact of residual connections on learning". arXiv:1602.07261 [cs.CV].

[resnetv2-10] ^ ^a ^b ^c ^d ^e ^f He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015). "Identity Mappings in Deep Residual Networks". arXiv:1603.05027 [cs.CV].

[gpt2paper-11] Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilya (14 February 2019). "Language models are unsupervised multitask learners" (PDF). Archived (PDF) from the original on 6 February 2021. Retrieved 19 December 2020.

[lose_rank-12] Dong, Yihe; Cordonnier, Jean-Baptiste; Loukas, Andreas (2021). "Attention is not all you need: pure attention loses rank doubly exponentially with depth". arXiv:2103.03404 [cs.LG].

[mlpbook-13] Rosenblatt, Frank (1961). Principles of neurodynamics. perceptrons and the theory of brain mechanisms (PDF).

[massbook-14] Venables, W. N.; Ripley, Brain D. (1994). Modern Applied Statistics with S-Plus. Springer. ISBN 9783540943501.

[prnnbook-15] Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press. doi:10.1017/CBO9780511812651. ISBN 978-0-521-46086-6.

[hochreiter1991-16] Hochreiter, Sepp (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (diploma thesis). Technical University Munich, Institute of Computer Science, advisor: J. Schmidhuber.

[lstm2000-17] Felix A. Gers; Jürgen Schmidhuber; Fred Cummins (2000). "Learning to Forget: Continual Prediction with LSTM". Neural Computation. 12 (10): 2451–2471. CiteSeerX 10.1.1.55.5709. doi:10.1162/089976600300015015. PMID 11032042. S2CID 11598600.

[highway2015july-18] Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (22 July 2015). "Training Very Deep Networks". arXiv:1507.06228 [cs.LG].

[highwayblog-19] Schmidhuber, Jürgen (2015). "Microsoft Wins ImageNet 2015 through Highway Net (or Feedforward LSTM) without Gates".

[20] Huang, Gao; Liu, Zhuang; van der Maaten, Laurens; Weinberger, Kilian (2016). Densely Connected Convolutional Networks. arXiv:1608.06993.

[21] Huang, Gao; Sun, Yu; Liu, Zhuang; Weinberger, Kilian (2016). Deep Networks with Stochastic Depth. arXiv:1603.09382.

[liao2016-22] Liao, Qianli; Poggio, Tomaso (2016). Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex. arXiv:1604.03640.

[xiao2018-23] Xiao, Will; Chen, Honglin; Liao, Qianli; Poggio, Tomaso (2018). Biologically-Plausible Learning Algorithms Can Scale to Large Datasets. arXiv:1811.03567.

[Winding2023-24] Winding, Michael; Pedigo, Benjamin; Barnes, Christopher; Patsolic, Heather; Park, Youngser; Kazimiers, Tom; Fushiki, Akira; Andrade, Ingrid; Khandelwal, Avinash; Valdes-Aleman, Javier; Li, Feng; Randel, Nadine; Barsotti, Elizabeth; Correia, Ana; Fetter, Fetter; Hartenstein, Volker; Priebe, Carey; Vogelstein, Joshua; Cardona, Albert; Zlatic, Marta (10 Mar 2023). "The connectome of an insect brain". Science. 379 (6636): eadd9330. bioRxiv 10.1101/2022.11.28.516756v1. doi:10.1126/science.add9330. PMC 7614541. PMID 36893230. S2CID 254070919.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]