Aman's AI Journal • Primers • Ilya Sutskever's Top 30
Blog

Aman's AI Journal • Primers • Ilya Sutskever's Top 30

2026.02.13
·Web·by 이호민
#AI#Deep Learning#LLM#Machine Learning#RNN

Key Points

  • 1This paper introduces Deep Residual Networks (ResNets), an architecture designed to ease the training of significantly deeper neural networks.
  • 2ResNets achieve this by using residual blocks, which allow layers to learn a residual mapping instead of the full underlying function.
  • 3This innovation facilitates training and improves accuracy with increased network depth, setting a new benchmark for image recognition.

This paper introduces Deep Residual Networks (ResNets), a novel architecture designed to facilitate the training of substantially deeper neural networks than previously feasible. The core problem addressed is the "degradation" phenomenon observed in very deep plain networks, where increasing depth leads to higher training error and consequently higher test error, rather than simply overfitting. This degradation is not caused by overfitting, but by the difficulty of optimizing very deep networks.

The central innovation of ResNets is the concept of "residual learning" through identity shortcut connections. Instead of aiming to directly learn an underlying mapping H(x)\mathcal{H}(\mathbf{x}) from input x\mathbf{x} to output y\mathbf{y}, a residual block is designed to learn a residual mapping F(x)=H(x)x\mathcal{F}(\mathbf{x}) = \mathcal{H}(\mathbf{x}) - \mathbf{x}. The output of such a block is then y=F(x,{Wi})+x\mathbf{y} = \mathcal{F}(\mathbf{x}, \{\mathbf{W}_i\}) + \mathbf{x}. Here, x\mathbf{x} represents the input to the block, F\mathcal{F} denotes the stacked non-linear layers (e.g., convolutional layers followed by ReLU activations and Batch Normalization), and {Wi}\{\mathbf{W}_i\} are the weights of these layers. The term x\mathbf{x} is added back to the output of F\mathcal{F} via an identity shortcut connection, which bypasses one or more layers.

The rationale behind residual learning is that if an optimal function is an identity mapping (i.e., H(x)=x\mathcal{H}(\mathbf{x}) = \mathbf{x}), it is easier to learn the residual F(x)=0\mathcal{F}(\mathbf{x}) = 0 than to approximate the identity mapping itself using a stack of non-linear layers. If the optimal function is a small perturbation around an identity mapping, learning this small perturbation (the residual) is also empirically found to be simpler and more stable for optimization. This mechanism allows for the stacking of many more layers without encountering the degradation problem, as unused layers can effectively learn to be identity mappings, passing through the features from previous layers.

Technically, a residual block can be formulated as:
y=F(x,{Wi})+x\mathbf{y} = \mathcal{F}(\mathbf{x}, \{\mathbf{W}_i\}) + \mathbf{x}
where x\mathbf{x} and y\mathbf{y} are the input and output feature maps of the block. The function F\mathcal{F} typically involves a sequence of convolutional layers, batch normalization, and ReLU non-linearities. For instance, a common block consists of two convolutional layers:
F(x)=W2σ(W1x)\mathcal{F}(\mathbf{x}) = \mathbf{W}_2 \sigma(\mathbf{W}_1 \mathbf{x})
where σ\sigma denotes the ReLU activation function. Batch Normalization is typically applied before the ReLU and after each convolution.

When the input and output dimensions of F\mathcal{F} are different (e.g., due to stride-2 convolutions downsampling spatial dimensions or increasing feature map depth), the identity shortcut x\mathbf{x} cannot directly be added to F(x)\mathcal{F}(\mathbf{x}). The paper proposes two strategies for dimension matching:

  1. Zero-padding: For increasing dimensions, zero entries are padded to the shortcut connection to match the increased feature map depth. This option introduces no additional parameters.
  2. Projection shortcut: A 1×11 \times 1 convolution is performed on the shortcut connection to match both the dimensions and feature depth:
y=F(x,{Wi})+Wsx\mathbf{y} = \mathcal{F}(\mathbf{x}, \{\mathbf{W}_i\}) + \mathbf{W}_s \mathbf{x}
where Ws\mathbf{W}_s is the square matrix of the 1×11 \times 1 convolution. This option introduces parameters but empirically performs better than zero-padding for more complex models. The authors primarily use projection shortcuts when dimensions increase and identity shortcuts otherwise.

ResNet architectures are constructed by stacking these residual blocks. They typically start with an initial convolutional layer and pooling, followed by several stages of residual blocks, often reducing spatial dimensions and increasing feature map depth between stages. The network concludes with global average pooling and a fully connected layer with softmax for classification. The paper demonstrated ResNets with depths up to 152 layers, significantly deeper than VGG nets (19 layers) or plain convolutional networks of similar depth.

Empirical results on the ImageNet classification dataset (ILSVRC 2012) showed that ResNets effectively overcome the degradation problem. For instance, a 34-layer ResNet had lower training error than a 34-layer plain net and lower error than an 18-layer plain net, demonstrating that deeper ResNets can be optimized more easily and achieve better performance. ResNet-152 achieved a top-5 error rate of 4.49% on ImageNet, which was significantly better than previous state-of-the-art models and surpassed human-level performance. The benefits of ResNets were also validated on other tasks like CIFAR-10 and object detection/segmentation on COCO, establishing them as a foundational architecture in computer vision.