mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-08-14 01:13:00 +08:00
✍️ typos
This commit is contained in:
@ -85,8 +85,8 @@ could be in distribution $\mathcal{N}(0.5, 1)$.
|
||||
Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
|
||||
This is <em>internal covariate shift</em>.</p>
|
||||
<p>Internal covariate shift will adversely affect training speed because the later layers
|
||||
($l_2$ in the above example) has to adapt to this shifted distribution.</p>
|
||||
<p>By stabilizing the distribution batch normalization minimizes the internal covariate shift.</p>
|
||||
($l_2$ in the above example) have to adapt to this shifted distribution.</p>
|
||||
<p>By stabilizing the distribution, batch normalization minimizes the internal covariate shift.</p>
|
||||
<h2>Normalization</h2>
|
||||
<p>It is known that whitening improves training speed and convergence.
|
||||
<em>Whitening</em> is linearly transforming inputs to have zero mean, unit variance,
|
||||
@ -95,9 +95,9 @@ and be uncorrelated.</p>
|
||||
<p>Normalizing outside the gradient computation using pre-computed (detached)
|
||||
means and variances doesn’t work. For instance. (ignoring variance), let
|
||||
<script type="math/tex; mode=display">\hat{x} = x - \mathbb{E}[x]</script>
|
||||
where $x = u + b$ and $b$ is a trained bias.
|
||||
and $\mathbb{E}[x]$ is outside gradient computation (pre-computed constant).</p>
|
||||
<p>Note that $\hat{x}$ has no effect of $b$.
|
||||
where $x = u + b$ and $b$ is a trained bias
|
||||
and $\mathbb{E}[x]$ is an outside gradient computation (pre-computed constant).</p>
|
||||
<p>Note that $\hat{x}$ has no effect on $b$.
|
||||
Therefore,
|
||||
$b$ will increase or decrease based
|
||||
$\frac{\partial{\mathcal{L}}}{\partial x}$,
|
||||
@ -106,14 +106,14 @@ The paper notes that similar explosions happen with variances.</p>
|
||||
<h3>Batch Normalization</h3>
|
||||
<p>Whitening is computationally expensive because you need to de-correlate and
|
||||
the gradients must flow through the full whitening calculation.</p>
|
||||
<p>The paper introduces simplified version which they call <em>Batch Normalization</em>.
|
||||
<p>The paper introduces a simplified version which they call <em>Batch Normalization</em>.
|
||||
First simplification is that it normalizes each feature independently to have
|
||||
zero mean and unit variance:
|
||||
<script type="math/tex; mode=display">\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}</script>
|
||||
where $x = (x^{(1)} … x^{(d)})$ is the $d$-dimensional input.</p>
|
||||
<p>The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
|
||||
and variance $Var[x^{(k)}]$ from the mini-batch
|
||||
for normalization; instead of calculating the mean and variance across whole dataset.</p>
|
||||
for normalization; instead of calculating the mean and variance across the whole dataset.</p>
|
||||
<p>Normalizing each feature to zero mean and unit variance could affect what the layer
|
||||
can represent.
|
||||
As an example paper illustrates that, if the inputs to a sigmoid are normalized
|
||||
@ -126,8 +126,8 @@ where $y^{(k)}$ is the output of the batch normalization layer.</p>
|
||||
like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
|
||||
So you can and should omit bias parameter in linear transforms right before the
|
||||
batch normalization.</p>
|
||||
<p>Batch normalization also makes the back propagation invariant to the scale of the weights.
|
||||
And empirically it improves generalization, so it has regularization effects too.</p>
|
||||
<p>Batch normalization also makes the back propagation invariant to the scale of the weights
|
||||
and empirically it improves generalization, so it has regularization effects too.</p>
|
||||
<h2>Inference</h2>
|
||||
<p>We need to know $\mathbb{E}[x^{(k)}]$ and $Var[x^{(k)}]$ in order to
|
||||
perform the normalization.
|
||||
@ -136,7 +136,7 @@ and find the mean and variance, or you can use an estimate calculated during tra
|
||||
The usual practice is to calculate an exponential moving average of
|
||||
mean and variance during the training phase and use that for inference.</p>
|
||||
<p>Here’s <a href="mnist.html">the training code</a> and a notebook for training
|
||||
a CNN classifier that use batch normalization for MNIST dataset.</p>
|
||||
a CNN classifier that uses batch normalization for MNIST dataset.</p>
|
||||
<p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
|
||||
<a href="https://web.lab-ml.com/run?uuid=011254fe647011ebbb8e0242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen" /></a></p>
|
||||
</div>
|
||||
@ -251,7 +251,7 @@ mean $\mathbb{E}[x^{(k)}]$ and variance $Var[x^{(k)}]$</p>
|
||||
<a href='#section-6'>#</a>
|
||||
</div>
|
||||
<p><code>x</code> is a tensor of shape <code>[batch_size, channels, *]</code>.
|
||||
<code>*</code> could be any number of (even 0) dimensions.
|
||||
<code>*</code> denotes any number of (possibly 0) dimensions.
|
||||
For example, in an image (2D) convolution this will be
|
||||
<code>[batch_size, channels, height, width]</code></p>
|
||||
</div>
|
||||
@ -286,7 +286,7 @@ mean $\mathbb{E}[x^{(k)}]$ and variance $Var[x^{(k)}]$</p>
|
||||
<div class='section-link'>
|
||||
<a href='#section-9'>#</a>
|
||||
</div>
|
||||
<p>Sanity check to make sure the number of features is same</p>
|
||||
<p>Sanity check to make sure the number of features is the same</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
<div class="highlight"><pre><span class="lineno">174</span> <span class="k">assert</span> <span class="bp">self</span><span class="o">.</span><span class="n">channels</span> <span class="o">==</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span></pre></div>
|
||||
|
Reference in New Issue
Block a user