mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-11-02 21:40:15 +08:00
typos in readmes
This commit is contained in:
@ -11,13 +11,13 @@ network parameters during training.
|
||||
For example, let's say there are two layers $l_1$ and $l_2$.
|
||||
During the beginning of the training $l_1$ outputs (inputs to $l_2$)
|
||||
could be in distribution $\mathcal{N}(0.5, 1)$.
|
||||
Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$.
|
||||
Then, after some training steps, it could move to $\mathcal{N}(0.6, 1.5)$.
|
||||
This is *internal covariate shift*.
|
||||
|
||||
Internal covariate shift will adversely affect training speed because the later layers
|
||||
($l_2$ in the above example) has to adapt to this shifted distribution.
|
||||
($l_2$ in the above example) have to adapt to this shifted distribution.
|
||||
|
||||
By stabilizing the distribution batch normalization minimizes the internal covariate shift.
|
||||
By stabilizing the distribution, batch normalization minimizes the internal covariate shift.
|
||||
|
||||
## Normalization
|
||||
|
||||
@ -30,10 +30,10 @@ and be uncorrelated.
|
||||
Normalizing outside the gradient computation using pre-computed (detached)
|
||||
means and variances doesn't work. For instance. (ignoring variance), let
|
||||
$$\hat{x} = x - \mathbb{E}[x]$$
|
||||
where $x = u + b$ and $b$ is a trained bias.
|
||||
and $\mathbb{E}[x]$ is outside gradient computation (pre-computed constant).
|
||||
where $x = u + b$ and $b$ is a trained bias
|
||||
and $\mathbb{E}[x]$ is an outside gradient computation (pre-computed constant).
|
||||
|
||||
Note that $\hat{x}$ has no effect of $b$.
|
||||
Note that $\hat{x}$ has no effect on $b$.
|
||||
Therefore,
|
||||
$b$ will increase or decrease based
|
||||
$\frac{\partial{\mathcal{L}}}{\partial x}$,
|
||||
@ -45,7 +45,7 @@ The paper notes that similar explosions happen with variances.
|
||||
Whitening is computationally expensive because you need to de-correlate and
|
||||
the gradients must flow through the full whitening calculation.
|
||||
|
||||
The paper introduces simplified version which they call *Batch Normalization*.
|
||||
The paper introduces a simplified version which they call *Batch Normalization*.
|
||||
First simplification is that it normalizes each feature independently to have
|
||||
zero mean and unit variance:
|
||||
$$\hat{x}^{(k)} = \frac{x^{(k)} - \mathbb{E}[x^{(k)}]}{\sqrt{Var[x^{(k)}]}}$$
|
||||
@ -53,7 +53,7 @@ where $x = (x^{(1)} ... x^{(d)})$ is the $d$-dimensional input.
|
||||
|
||||
The second simplification is to use estimates of mean $\mathbb{E}[x^{(k)}]$
|
||||
and variance $Var[x^{(k)}]$ from the mini-batch
|
||||
for normalization; instead of calculating the mean and variance across whole dataset.
|
||||
for normalization; instead of calculating the mean and variance across the whole dataset.
|
||||
|
||||
Normalizing each feature to zero mean and unit variance could affect what the layer
|
||||
can represent.
|
||||
@ -69,8 +69,8 @@ like $Wu + b$ the bias parameter $b$ gets cancelled due to normalization.
|
||||
So you can and should omit bias parameter in linear transforms right before the
|
||||
batch normalization.
|
||||
|
||||
Batch normalization also makes the back propagation invariant to the scale of the weights.
|
||||
And empirically it improves generalization, so it has regularization effects too.
|
||||
Batch normalization also makes the back propagation invariant to the scale of the weights
|
||||
and empirically it improves generalization, so it has regularization effects too.
|
||||
|
||||
## Inference
|
||||
|
||||
@ -81,8 +81,8 @@ and find the mean and variance, or you can use an estimate calculated during tra
|
||||
The usual practice is to calculate an exponential moving average of
|
||||
mean and variance during the training phase and use that for inference.
|
||||
|
||||
Here's [the training code](https://nn.labml.ai/normalization/layer_norm/mnist.html) and a notebook for training
|
||||
a CNN classifier that use batch normalization for MNIST dataset.
|
||||
Here's [the training code](mnist.html) and a notebook for training
|
||||
a CNN classifier that uses batch normalization for MNIST dataset.
|
||||
|
||||
[](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/normalization/batch_norm/mnist.ipynb)
|
||||
[](https://web.lab-ml.com/run?uuid=011254fe647011ebbb8e0242ac1c0002)
|
||||
|
||||
@ -13,8 +13,9 @@ Our implementation only has a few million parameters and doesn't do model parall
|
||||
It does single GPU training, but we implement the concept of switching as described in the paper.
|
||||
|
||||
The Switch Transformer uses different parameters for each token by switching among parameters
|
||||
based on the token. Thererfore, only a fraction of parameters are chosen for each token. So you
|
||||
can have more parameters but less computational cost.
|
||||
based on the token.
|
||||
Therefore, only a fraction of parameters are chosen for each token.
|
||||
So you can have more parameters but less computational cost.
|
||||
|
||||
The switching happens at the Position-wise Feedforward network (FFN) of each transformer block.
|
||||
Position-wise feedforward network consists of two sequentially fully connected layers.
|
||||
|
||||
@ -5,17 +5,18 @@ This is a miniature [PyTorch](https://pytorch.org) implementation of the paper
|
||||
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
|
||||
It does single GPU training, but we implement the concept of switching as described in the paper.
|
||||
|
||||
The Switch Transformer uses different parameters for each token by switching among parameters,
|
||||
based on the token. So only a fraction of parameters is chosen for each token, so you
|
||||
can have more parameters but less computational cost.
|
||||
The Switch Transformer uses different parameters for each token by switching among parameters
|
||||
based on the token.
|
||||
Therefore, only a fraction of parameters are chosen for each token.
|
||||
So you can have more parameters but less computational cost.
|
||||
|
||||
The switching happens at the Position-wise Feedforward network (FFN) of each transformer block.
|
||||
Position-wise feedforward network is a two sequentially fully connected layers.
|
||||
In switch transformer we have multiple FFNs (multiple experts),
|
||||
Position-wise feedforward network consists of two sequentially fully connected layers.
|
||||
In switch transformer we have multiple FFNs (multiple experts),
|
||||
and we chose which one to use based on a router.
|
||||
The outputs a set of probabilities for picking a FFN,
|
||||
and we pick the one with the highest probability and only evaluates that.
|
||||
So essentially the computational cost is same as having a single FFN.
|
||||
The output is a set of probabilities for picking a FFN,
|
||||
and we pick the one with the highest probability and only evaluate that.
|
||||
So essentially the computational cost is the same as having a single FFN.
|
||||
In our implementation this doesn't parallelize well when you have many or large FFNs since it's all
|
||||
happening on a single GPU.
|
||||
In a distributed setup you would have each FFN (each very large) on a different device.
|
||||
|
||||
@ -9,7 +9,7 @@ equal to the length of the sequence trained in parallel.
|
||||
All these positions have a fixed positional encoding.
|
||||
Transformer XL increases this attention span by letting
|
||||
each of the positions pay attention to precalculated past embeddings.
|
||||
For instance if the context length is $l$ it will keep the embeddings of
|
||||
For instance if the context length is $l$, it will keep the embeddings of
|
||||
all layers for previous batch of length $l$ and feed them to current step.
|
||||
If we use fixed-positional encodings these pre-calculated embeddings will have
|
||||
the same positions as the current context.
|
||||
|
||||
Reference in New Issue
Block a user