#

Deep Residual Learning for Image Recognition (ResNet)

This is a PyTorch implementation of the paper Deep Residual Learning for Image Recognition.

ResNets train layers as residual functions to overcome the degradation problem. The degradation problem is the accuracy of deep neural networks degrading when the number of layers becomes very high. The accuracy increases as the number of layers increase, then saturates, and then starts to degrade.

The paper argues that deeper models should perform at least as well as shallower models because the extra layers can just learn to perform an identity mapping.

Residual Learning

If $H (x)$ is the mapping that needs to be learned by a few layers, they train the residual function

$F (x) = H (x) - x$

instead. And the original function becomes $F (x) + x$ .

In this case, learning identity mapping for $H (x)$ is equivalent to learning $F (x)$ to be $0$ , which is easier to learn.

In the parameterized form this can be written as,

$F (x, {W_{i}}) + x$

and when the feature map sizes of $F (x, W_{i})$ and $x$ are different the paper suggests doing a linear projection, with learned weights $W_{s}$ .

$F (x, {W_{i}}) + W_{s} x$

Paper experimented with zero padding instead of linear projections and found linear projections to work better. Also when the feature map sizes match they found identity mapping to be better than linear projections.

$F$ should have more than one layer, otherwise the sum $F (x, {W_{i}}) + W_{s} x$ also won't have non-linearities and will be like a linear layer.

Here is the training code for training a ResNet on CIFAR-10.

55from typing import List, Optional
56
57import torch
58from torch import nn

#

Linear projections for shortcut connection

This does the $W_{s} x$ projection described above.

62class ShortcutProjection(nn.Module):

#

in_channels is the number of channels in $x$
out_channels is the number of channels in $F (x, {W_{i}})$
stride is the stride length in the convolution operation for $F$ . We do the same stride on the shortcut connection, to match the feature-map size.

69    def __init__(self, in_channels: int, out_channels: int, stride: int):

#

76        super().__init__()

#

Convolution layer for linear projection $W_{s} x$

79        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride)

#

Paper suggests adding batch normalization after each convolution operation

81        self.bn = nn.BatchNorm2d(out_channels)

#

83    def forward(self, x: torch.Tensor):

#

Convolution and batch normalization

85        return self.bn(self.conv(x))

#

Residual Block

This implements the residual block described in the paper. It has two $3 \times 3$ convolution layers.

Residual Block

The first convolution layer maps from in_channels to out_channels , where the out_channels is higher than in_channels when we reduce the feature map size with a stride length greater than $1$ .

The second convolution layer maps from out_channels to out_channels and always has a stride length of 1.

Both convolution layers are followed by batch normalization.

88class ResidualBlock(nn.Module):

#

in_channels is the number of channels in $x$
out_channels is the number of output channels
stride is the stride length in the convolution operation.

109    def __init__(self, in_channels: int, out_channels: int, stride: int):

#

115        super().__init__()

#

First $3 \times 3$ convolution layer, this maps to out_channels

118        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)

#

Batch normalization after the first convolution

120        self.bn1 = nn.BatchNorm2d(out_channels)

#

First activation function (ReLU)

122        self.act1 = nn.ReLU()

#

Second $3 \times 3$ convolution layer

125        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)

#

Batch normalization after the second convolution

127        self.bn2 = nn.BatchNorm2d(out_channels)

#

Shortcut connection should be a projection if the stride length is not $1$ or if the number of channels change

131        if stride != 1 or in_channels != out_channels:

#

Projection $W_{s} x$

133            self.shortcut = ShortcutProjection(in_channels, out_channels, stride)
134        else:

#

Identity $x$

136            self.shortcut = nn.Identity()

#

Second activation function (ReLU) (after adding the shortcut)

139        self.act2 = nn.ReLU()

#

x is the input of shape [batch_size, in_channels, height, width]

141    def forward(self, x: torch.Tensor):

#

Get the shortcut connection

146        shortcut = self.shortcut(x)

#

First convolution and activation

148        x = self.act1(self.bn1(self.conv1(x)))

#

Second convolution

150        x = self.bn2(self.conv2(x))

#

Activation function after adding the shortcut

152        return self.act2(x + shortcut)

#

Bottleneck Residual Block

This implements the bottleneck block described in the paper. It has $1 \times 1$ , $3 \times 3$ , and $1 \times 1$ convolution layers.

Bottlenext Block

The first convolution layer maps from in_channels to bottleneck_channels with a $1 \times 1$ convolution, where the bottleneck_channels is lower than in_channels .

The second $3 \times 3$ convolution layer maps from bottleneck_channels to bottleneck_channels . This can have a stride length greater than $1$ when we want to compress the feature map size.

The third, final $1 \times 1$ convolution layer maps to out_channels . out_channels is higher than in_channels if the stride length is greater than $1$ ; otherwise, $o u t_{c} hann e l s$ is equal to in_channels .

bottleneck_channels is less than in_channels and the $3 \times 3$ convolution is performed on this shrunk space (hence the bottleneck). The two $1 \times 1$ convolution decreases and increases the number of channels.

155class BottleneckResidualBlock(nn.Module):

#

in_channels is the number of channels in $x$
bottleneck_channels is the number of channels for the $3 \times 3$ convlution
out_channels is the number of output channels
stride is the stride length in the $3 \times 3$ convolution operation.

183    def __init__(self, in_channels: int, bottleneck_channels: int, out_channels: int, stride: int):

#

190        super().__init__()

#

First $1 \times 1$ convolution layer, this maps to bottleneck_channels

193        self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, kernel_size=1, stride=1)

#

Batch normalization after the first convolution

195        self.bn1 = nn.BatchNorm2d(bottleneck_channels)

#

First activation function (ReLU)

197        self.act1 = nn.ReLU()

#

Second $3 \times 3$ convolution layer

200        self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, kernel_size=3, stride=stride, padding=1)

#

Batch normalization after the second convolution

202        self.bn2 = nn.BatchNorm2d(bottleneck_channels)

#

Second activation function (ReLU)

204        self.act2 = nn.ReLU()

#

Third $1 \times 1$ convolution layer, this maps to out_channels .

207        self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, kernel_size=1, stride=1)

#

Batch normalization after the second convolution

209        self.bn3 = nn.BatchNorm2d(out_channels)

#

Shortcut connection should be a projection if the stride length is not $1$ or if the number of channels change

213        if stride != 1 or in_channels != out_channels:

#

Projection $W_{s} x$

215            self.shortcut = ShortcutProjection(in_channels, out_channels, stride)
216        else:

#

Identity $x$

218            self.shortcut = nn.Identity()

#

Second activation function (ReLU) (after adding the shortcut)

221        self.act3 = nn.ReLU()

#

x is the input of shape [batch_size, in_channels, height, width]

223    def forward(self, x: torch.Tensor):

#

Get the shortcut connection

228        shortcut = self.shortcut(x)

#

First convolution and activation

230        x = self.act1(self.bn1(self.conv1(x)))

#

Second convolution and activation

232        x = self.act2(self.bn2(self.conv2(x)))

#

Third convolution

234        x = self.bn3(self.conv3(x))

#

Activation function after adding the shortcut

236        return self.act3(x + shortcut)

#

ResNet Model

This is a the base of the resnet model without the final linear layer and softmax for classification.

The resnet is made of stacked residual blocks or bottleneck residual blocks. The feature map size is halved after a few blocks with a block of stride length $2$ . The number of channels is increased when the feature map size is reduced. Finally the feature map is average pooled to get a vector representation.

239class ResNetBase(nn.Module):

#

n_blocks is a list of of number of blocks for each feature map size.
n_channels is the number of channels for each feature map size.
bottlenecks is the number of channels the bottlenecks. If this is None , residual blocks are used.
img_channels is the number of channels in the input.
first_kernel_size is the kernel size of the initial convolution layer

253    def __init__(self, n_blocks: List[int], n_channels: List[int],
254                 bottlenecks: Optional[List[int]] = None,
255                 img_channels: int = 3, first_kernel_size: int = 7):

#

264        super().__init__()

#

Number of blocks and number of channels for each feature map size

267        assert len(n_blocks) == len(n_channels)

#

If bottleneck residual blocks are used, the number of channels in bottlenecks should be provided for each feature map size

270        assert bottlenecks is None or len(bottlenecks) == len(n_channels)

#

Initial convolution layer maps from img_channels to number of channels in the first residual block (n_channels[0] )

274        self.conv = nn.Conv2d(img_channels, n_channels[0],
275                              kernel_size=first_kernel_size, stride=2, padding=first_kernel_size // 2)

#

Batch norm after initial convolution

277        self.bn = nn.BatchNorm2d(n_channels[0])

#

List of blocks

280        blocks = []

#

Number of channels from previous layer (or block)

282        prev_channels = n_channels[0]

#

Loop through each feature map size

284        for i, channels in enumerate(n_channels):

#

The first block for the new feature map size, will have a stride length of $2$ except fro the very first block

287            stride = 2 if len(blocks) == 0 else 1
288
289            if bottlenecks is None:

#

residual blocks that maps from prev_channels to channels

291                blocks.append(ResidualBlock(prev_channels, channels, stride=stride))
292            else:

#

bottleneck residual blocks that maps from prev_channels to channels

295                blocks.append(BottleneckResidualBlock(prev_channels, bottlenecks[i], channels,
296                                                      stride=stride))

#

Change the number of channels

299            prev_channels = channels

#

Add rest of the blocks - no change in feature map size or channels

301            for _ in range(n_blocks[i] - 1):
302                if bottlenecks is None:

#

residual blocks

304                    blocks.append(ResidualBlock(channels, channels, stride=1))
305                else:

#

bottleneck residual blocks

307                    blocks.append(BottleneckResidualBlock(channels, bottlenecks[i], channels, stride=1))

#

Stack the blocks

310        self.blocks = nn.Sequential(*blocks)

#

x has shape [batch_size, img_channels, height, width]

312    def forward(self, x: torch.Tensor):

#

Initial convolution and batch normalization

318        x = self.bn(self.conv(x))

#

Residual (or bottleneck) blocks

320        x = self.blocks(x)

#

Change x from shape [batch_size, channels, h, w] to [batch_size, channels, h * w]

322        x = x.view(x.shape[0], x.shape[1], -1)

#

Global average pooling

324        return x.mean(dim=-1)