mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-08-26 08:41:23 +08:00
📚 batch norm mistakes
This commit is contained in:
@ -82,14 +82,14 @@ network parameters during training.
|
||||
For example, let’s say there are two layers $l_1$ and $l_2$.
|
||||
During the beginning of the training $l_1$ outputs (inputs to $l_2$)
|
||||
could be in distribution $\mathcal{N}(0.5, 1)$.
|
||||
Then, after some training steps it could move to $\mathcal{N}(0.5, 1)$.
|
||||
Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
|
||||
This is <em>internal covariate shift</em>.</p>
|
||||
<p>Internal covriate shift will adversely affect training speed because the later layers
|
||||
<p>Internal covariate shift will adversely affect training speed because the later layers
|
||||
($l_2$ in the above example) has to adapt to this shifted distribution.</p>
|
||||
<p>By stabilizing the distribution batch normalization minimizes the internal covariate shift.</p>
|
||||
<h2>Normalization</h2>
|
||||
<p>It is known that whitening improves training speed and convergence.
|
||||
<em>Whitening</em> is linearly transforming inputs to have zero mean, unit variance
|
||||
<em>Whitening</em> is linearly transforming inputs to have zero mean, unit variance,
|
||||
and be uncorrelated.</p>
|
||||
<h3>Normalizing outside gradient computation doesn’t work</h3>
|
||||
<p>Normalizing outside the gradient computation using pre-computed (detached)
|
||||
@ -102,7 +102,7 @@ Therefore,
|
||||
$b$ will increase or decrease based
|
||||
$\frac{\partial{\mathcal{L}}}{\partial x}$,
|
||||
and keep on growing indefinitely in each training update.
|
||||
Paper notes that similar explosions happen with variances.</p>
|
||||
The paper notes that similar explosions happen with variances.</p>
|
||||
<h3>Batch Normalization</h3>
|
||||
<p>Whitening is computationally expensive because you need to de-correlate and
|
||||
the gradients must flow through the full whitening calculation.</p>
|
||||
@ -114,14 +114,14 @@ where $x = (x^{(1)} … x^{(d)})$ is the $d$-dimensional input.</p>
|
||||
<p>The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
|
||||
and variance $Var[x^{(k)}]$ from the mini-batch
|
||||
for normalization; instead of calculating the mean and variance across whole dataset.</p>
|
||||
<p>Normalizing each feature to zero mean and unit variance could effect what the layer
|
||||
<p>Normalizing each feature to zero mean and unit variance could affect what the layer
|
||||
can represent.
|
||||
As an example paper illustrates that, if the inputs to a sigmoid are normalized
|
||||
most of it will be within $[-1, 1]$ range where the sigmoid is linear.
|
||||
To overcome this each feature is scaled and shifted by two trained parameters
|
||||
$\gamma^{(k)}$ and $\beta^{(k)}$.
|
||||
<script type="math/tex; mode=display">y^{(k)} =\gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}</script>
|
||||
where $y^{(k)}$ is the output of of the batch normalization layer.</p>
|
||||
where $y^{(k)}$ is the output of the batch normalization layer.</p>
|
||||
<p>Note that when applying batch normalization after a linear transform
|
||||
like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
|
||||
So you can and should omit bias parameter in linear transforms right before the
|
||||
@ -134,7 +134,7 @@ perform the normalization.
|
||||
So during inference, you either need to go through the whole (or part of) dataset
|
||||
and find the mean and variance, or you can use an estimate calculated during training.
|
||||
The usual practice is to calculate an exponential moving average of
|
||||
mean and variance during training phase and use that for inference.</p>
|
||||
mean and variance during the training phase and use that for inference.</p>
|
||||
<p>Here’s <a href="mnist.html">the training code</a> and a notebook for training
|
||||
a CNN classifier that use batch normalization for MNIST dataset.</p>
|
||||
<p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
|
||||
|
@ -18,10 +18,10 @@ network parameters during training.
|
||||
For example, let's say there are two layers $l_1$ and $l_2$.
|
||||
During the beginning of the training $l_1$ outputs (inputs to $l_2$)
|
||||
could be in distribution $\mathcal{N}(0.5, 1)$.
|
||||
Then, after some training steps it could move to $\mathcal{N}(0.5, 1)$.
|
||||
Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
|
||||
This is *internal covariate shift*.
|
||||
|
||||
Internal covriate shift will adversely affect training speed because the later layers
|
||||
Internal covariate shift will adversely affect training speed because the later layers
|
||||
($l_2$ in the above example) has to adapt to this shifted distribution.
|
||||
|
||||
By stabilizing the distribution batch normalization minimizes the internal covariate shift.
|
||||
@ -29,7 +29,7 @@ By stabilizing the distribution batch normalization minimizes the internal covar
|
||||
## Normalization
|
||||
|
||||
It is known that whitening improves training speed and convergence.
|
||||
*Whitening* is linearly transforming inputs to have zero mean, unit variance
|
||||
*Whitening* is linearly transforming inputs to have zero mean, unit variance,
|
||||
and be uncorrelated.
|
||||
|
||||
### Normalizing outside gradient computation doesn't work
|
||||
@ -45,7 +45,7 @@ Therefore,
|
||||
$b$ will increase or decrease based
|
||||
$\frac{\partial{\mathcal{L}}}{\partial x}$,
|
||||
and keep on growing indefinitely in each training update.
|
||||
Paper notes that similar explosions happen with variances.
|
||||
The paper notes that similar explosions happen with variances.
|
||||
|
||||
### Batch Normalization
|
||||
|
||||
@ -62,14 +62,14 @@ The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
|
||||
and variance $Var[x^{(k)}]$ from the mini-batch
|
||||
for normalization; instead of calculating the mean and variance across whole dataset.
|
||||
|
||||
Normalizing each feature to zero mean and unit variance could effect what the layer
|
||||
Normalizing each feature to zero mean and unit variance could affect what the layer
|
||||
can represent.
|
||||
As an example paper illustrates that, if the inputs to a sigmoid are normalized
|
||||
most of it will be within $[-1, 1]$ range where the sigmoid is linear.
|
||||
To overcome this each feature is scaled and shifted by two trained parameters
|
||||
$\gamma^{(k)}$ and $\beta^{(k)}$.
|
||||
$$y^{(k)} =\gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}$$
|
||||
where $y^{(k)}$ is the output of of the batch normalization layer.
|
||||
where $y^{(k)}$ is the output of the batch normalization layer.
|
||||
|
||||
Note that when applying batch normalization after a linear transform
|
||||
like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
|
||||
@ -86,7 +86,7 @@ perform the normalization.
|
||||
So during inference, you either need to go through the whole (or part of) dataset
|
||||
and find the mean and variance, or you can use an estimate calculated during training.
|
||||
The usual practice is to calculate an exponential moving average of
|
||||
mean and variance during training phase and use that for inference.
|
||||
mean and variance during the training phase and use that for inference.
|
||||
|
||||
Here's [the training code](mnist.html) and a notebook for training
|
||||
a CNN classifier that use batch normalization for MNIST dataset.
|
||||
|
Reference in New Issue
Block a user