From 42257b7130c27254e95ee52b79bd5c874efda963 Mon Sep 17 00:00:00 2001 From: Varuna Jayasiri Date: Sun, 10 Apr 2022 13:41:33 +0530 Subject: [PATCH] resnet katex fix --- docs/resnet/index.html | 66 ++++++++++++++++++------------------- labml_nn/resnet/__init__.py | 14 ++++---- 2 files changed, 40 insertions(+), 40 deletions(-) diff --git a/docs/resnet/index.html b/docs/resnet/index.html index 4d975f2f..355c04ac 100644 --- a/docs/resnet/index.html +++ b/docs/resnet/index.html @@ -73,16 +73,16 @@

ResNets train layers as residual functions to overcome the degradation problem. The degradation problem is the accuracy of deep neural networks degrading when the number of layers becomes very high. The accuracy increases as the number of layers increase, then saturates, and then starts to degrade.

The paper argues that deeper models should perform at least as well as shallower models because the extra layers can just learn to perform an identity mapping.

Residual Learning

-

If is the mapping that needs to be learned by a few layers, they train the residual function

-

-

instead. And the original function becomes .

-

In this case, learning identity mapping for is equivalent to learning to be , which is easier to learn.

+

If is the mapping that needs to be learned by a few layers, they train the residual function

+

+

instead. And the original function becomes .

+

In this case, learning identity mapping for is equivalent to learning to be , which is easier to learn.

In the parameterized form this can be written as,

-

-

and when the feature map sizes of and are different the paper suggests doing a linear projection, with learned weights .

-

+

+

and when the feature map sizes of and are different the paper suggests doing a linear projection, with learned weights .

+

Paper experimented with zero padding instead of linear projections and found linear projections to work better. Also when the feature map sizes match they found identity mapping to be better than linear projections.

-

should have more than one layer, otherwise the sum also won't have non-linearities and will be like a linear layer.

+

should have more than one layer, otherwise the sum also won't have non-linearities and will be like a linear layer.

Here is the training code for training a ResNet on CIFAR-10.

View Run

@@ -102,7 +102,7 @@ #

Linear projections for shortcut connection

-

This does the projection described above.

+

This does the projection described above.

@@ -115,11 +115,11 @@ #
+ is the stride length in the convolution operation for . We do the same stride on the shortcut connection, to match the feature-map size.
@@ -142,7 +142,7 @@ -

Convolution layer for linear projection

+

Convolution layer for linear projection

@@ -197,7 +197,7 @@ to out_channels , where the out_channels is higher than in_channels - when we reduce the feature map size with a stride length greater than .

+ when we reduce the feature map size with a stride length greater than .

The second convolution layer maps from out_channels to out_channels and always has a stride length of 1.

@@ -214,7 +214,7 @@ #
@@ -474,7 +474,7 @@ -

First convolution layer, this maps to bottleneck_channels +

First convolution layer, this maps to bottleneck_channels

@@ -547,7 +547,7 @@ -

Third convolution layer, this maps to out_channels +

Third convolution layer, this maps to out_channels .

@@ -572,7 +572,7 @@ -

Shortcut connection should be a projection if the stride length is not of if the number of channels change

+

Shortcut connection should be a projection if the stride length is not of if the number of channels change

@@ -584,7 +584,7 @@ -

Projection

+

Projection

@@ -597,7 +597,7 @@ -

Identity

+

Identity

diff --git a/labml_nn/resnet/__init__.py b/labml_nn/resnet/__init__.py index 48516bf8..ea6f23e1 100644 --- a/labml_nn/resnet/__init__.py +++ b/labml_nn/resnet/__init__.py @@ -166,29 +166,29 @@ class BottleneckResidualBlock(Module): ![Bottlenext Block](bottleneck_block.svg) - The first convolution layer maps from `in_channels` to `bottleneck_channels` with a $1x1$ + The first convolution layer maps from `in_channels` to `bottleneck_channels` with a $1 \times 1$ convolution, where the `bottleneck_channels` is lower than `in_channels`. - The second $3x3$ convolution layer maps from `bottleneck_channels` to `bottleneck_channels`. + The second $3 \times 3$ convolution layer maps from `bottleneck_channels` to `bottleneck_channels`. This can have a stride length greater than $1$ when we want to compress the feature map size. - The third, final $1x1$ convolution layer maps to `out_channels`. + The third, final $1 \times 1$ convolution layer maps to `out_channels`. `out_channels` is higher than `in_channels` if the stride length is greater than $1$; otherwise, $out_channels$ is equal to `in_channels`. - `bottleneck_channels` is less than `in_channels` and the $3x3$ convolution is performed - on this shrunk space (hence the bottleneck). The two $1x1$ convolution decreases and increases + `bottleneck_channels` is less than `in_channels` and the $3 \times 3$ convolution is performed + on this shrunk space (hence the bottleneck). The two $1 \times 1$ convolution decreases and increases the number of channels. """ def __init__(self, in_channels: int, bottleneck_channels: int, out_channels: int, stride: int): """ * `in_channels` is the number of channels in $x$ - * `bottleneck_channels` is the number of channels for the $3x3$ convlution + * `bottleneck_channels` is the number of channels for the $3 \times 3$ convlution * `out_channels` is the number of output channels - * `stride` is the stride length in the $3x3$ convolution operation. + * `stride` is the stride length in the $3 \times 3$ convolution operation. """ super().__init__()