mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-08-26 08:41:23 +08:00
✍️ english
This commit is contained in:
@ -73,7 +73,7 @@
|
||||
<a href='#section-0'>#</a>
|
||||
</div>
|
||||
<h1>GPT</h1>
|
||||
<p>This is an tutorial/implementation of
|
||||
<p>This is a tutorial/implementation of
|
||||
<a href="https://openai.com/blog/better-language-models/">OpenAI GPT architecture</a>
|
||||
in <a href="https://pytorch.org">PyTorch</a>.
|
||||
We got a bunch of implementation details from
|
||||
@ -85,7 +85,7 @@ GPT-2 and especially GPT-3 models are quite large and won’t fit on a
|
||||
single GPU and will need model parallelism.
|
||||
This implementation doesn’t even use data parallelism and is intended to be
|
||||
more of a tutorial.</p>
|
||||
<p>Main differences of this to a standard autoregressive transformer
|
||||
<p>Main differences of this compared to a simple autoregressive transformer
|
||||
are the parameter initialization, weight decay, and learning rate schedule.
|
||||
For the transformer we reuse the
|
||||
<a href="../transformers/index.html">existing labml/nn transformer implementation</a>.</p>
|
||||
@ -516,7 +516,7 @@ This applies weight decay only to weights of linear layers.</p>
|
||||
<a href='#section-34'>#</a>
|
||||
</div>
|
||||
<p>Create a <a href="../optimizers/configs.html#OptimizerConfigs">configurable optimizer</a>,
|
||||
so that we can change these simple by passing
|
||||
so that we can change these simply by passing
|
||||
a config dictionary.</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
@ -528,7 +528,7 @@ a config dictionary.</p>
|
||||
<div class='section-link'>
|
||||
<a href='#section-35'>#</a>
|
||||
</div>
|
||||
<p>Set parameter groups for optimization</p>
|
||||
<p>Set parameter groups for optimization.</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
<div class="highlight"><pre><span class="lineno">196</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">parameters</span> <span class="o">=</span> <span class="n">opt_groups</span></pre></div>
|
||||
@ -539,8 +539,8 @@ a config dictionary.</p>
|
||||
<div class='section-link'>
|
||||
<a href='#section-36'>#</a>
|
||||
</div>
|
||||
<p>Use <a href="../optimizers/adam_warmup_cosine_decay.html">cosine decay optimizer</a>
|
||||
This is what GPT uses</p>
|
||||
<p>Use <a href="../optimizers/adam_warmup_cosine_decay.html">cosine decay optimizer</a>.
|
||||
This is what GPT uses.</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
<div class="highlight"><pre><span class="lineno">199</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">optimizer</span> <span class="o">=</span> <span class="s1">'AdamWarmupCosineDecay'</span></pre></div>
|
||||
@ -552,7 +552,7 @@ This is what GPT uses</p>
|
||||
<a href='#section-37'>#</a>
|
||||
</div>
|
||||
<p>Set model embedding size, required if we use <a href="../optimizers/noam.html">Noam optimizer</a>
|
||||
which has an exponential decay</p>
|
||||
which has an exponential decay.</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
<div class="highlight"><pre><span class="lineno">202</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">d_model</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">d_model</span></pre></div>
|
||||
@ -564,7 +564,7 @@ which has an exponential decay</p>
|
||||
<a href='#section-38'>#</a>
|
||||
</div>
|
||||
<p>Set default weight decay.
|
||||
This is not required since we set the weight decay in the parameter groups</p>
|
||||
This is not required since we set the weight decay in the parameter groups.</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
<div class="highlight"><pre><span class="lineno">205</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">weight_decay</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">weight_decay</span></pre></div>
|
||||
@ -575,7 +575,7 @@ This is not required since we set the weight decay in the parameter groups</p>
|
||||
<div class='section-link'>
|
||||
<a href='#section-39'>#</a>
|
||||
</div>
|
||||
<p>GPT uses a maximum learning rate of $6 \times 10^{-4}$</p>
|
||||
<p>GPT uses a maximum learning rate of $6 \times 10^{-4}$.</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
<div class="highlight"><pre><span class="lineno">207</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">6e-4</span></pre></div>
|
||||
@ -608,7 +608,7 @@ This is not required since we set the weight decay in the parameter groups</p>
|
||||
<div class='section-link'>
|
||||
<a href='#section-42'>#</a>
|
||||
</div>
|
||||
<p>Weight decay decoupled from gradients</p>
|
||||
<p>Weight decay is decoupled from gradients</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
<div class="highlight"><pre><span class="lineno">213</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">weight_decouple</span> <span class="o">=</span> <span class="kc">True</span></pre></div>
|
||||
|
@ -176,14 +176,14 @@
|
||||
<p><a id="TransformerLayer"></p>
|
||||
<h2>Transformer Layer</h2>
|
||||
<p></a></p>
|
||||
<p>This can act as a encoder layer or a decoder layer.</p>
|
||||
<p>This can act as an encoder layer or a decoder layer.</p>
|
||||
<p>🗒 Some implementations, including the paper seem to have differences
|
||||
in where the layer-normalization is done.
|
||||
Here we do a layer normalization before attention and feed-forward networks,
|
||||
and add the original residual vectors.
|
||||
Alternative is to do a layer normalization after adding the residuals.
|
||||
But we found this to be less stable when training.
|
||||
We found a detailed discussion about this in paper
|
||||
We found a detailed discussion about this in the paper
|
||||
<a href="https://arxiv.org/abs/2002.04745">On Layer Normalization in the Transformer Architecture</a>.</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
@ -646,7 +646,7 @@ Initialize parameters with Glorot / fan_avg.</p>
|
||||
<div class='section-link'>
|
||||
<a href='#section-44'>#</a>
|
||||
</div>
|
||||
<p>Runs the source through encoder</p>
|
||||
<p>Run the source through encoder</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
<div class="highlight"><pre><span class="lineno">224</span> <span class="n">enc</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">)</span></pre></div>
|
||||
|
@ -164,8 +164,8 @@ write the <code>get_scores</code> method.</p>
|
||||
<div class='section-link'>
|
||||
<a href='#section-6'>#</a>
|
||||
</div>
|
||||
<p>The linear transformations doesn’t need a bias since we take care of it when
|
||||
calculating scores.
|
||||
<p>The linear transformations do not need a bias since we
|
||||
explicitly include it when calculating scores.
|
||||
However having a bias for <code>value</code> might make sense.</p>
|
||||
</div>
|
||||
<div class='code'>
|
||||
|
@ -7,7 +7,7 @@ summary: >
|
||||
|
||||
# GPT
|
||||
|
||||
This is an tutorial/implementation of
|
||||
This is a tutorial/implementation of
|
||||
[OpenAI GPT architecture](https://openai.com/blog/better-language-models/)
|
||||
in [PyTorch](https://pytorch.org).
|
||||
We got a bunch of implementation details from
|
||||
@ -21,7 +21,7 @@ single GPU and will need model parallelism.
|
||||
This implementation doesn't even use data parallelism and is intended to be
|
||||
more of a tutorial.
|
||||
|
||||
Main differences of this to a standard autoregressive transformer
|
||||
Main differences of this compared to a simple autoregressive transformer
|
||||
are the parameter initialization, weight decay, and learning rate schedule.
|
||||
For the transformer we reuse the
|
||||
[existing labml/nn transformer implementation](../transformers/index.html).
|
||||
@ -188,28 +188,28 @@ def transformer_optimizer(c: NLPAutoRegressionConfigs):
|
||||
]
|
||||
|
||||
# Create a [configurable optimizer](../optimizers/configs.html#OptimizerConfigs),
|
||||
# so that we can change these simple by passing
|
||||
# so that we can change these simply by passing
|
||||
# a config dictionary.
|
||||
optimizer = OptimizerConfigs()
|
||||
|
||||
# Set parameter groups for optimization
|
||||
# Set parameter groups for optimization.
|
||||
optimizer.parameters = opt_groups
|
||||
# Use [cosine decay optimizer](../optimizers/adam_warmup_cosine_decay.html)
|
||||
# This is what GPT uses
|
||||
# Use [cosine decay optimizer](../optimizers/adam_warmup_cosine_decay.html).
|
||||
# This is what GPT uses.
|
||||
optimizer.optimizer = 'AdamWarmupCosineDecay'
|
||||
# Set model embedding size, required if we use [Noam optimizer](../optimizers/noam.html)
|
||||
# which has an exponential decay
|
||||
# which has an exponential decay.
|
||||
optimizer.d_model = c.d_model
|
||||
# Set default weight decay.
|
||||
# This is not required since we set the weight decay in the parameter groups
|
||||
# This is not required since we set the weight decay in the parameter groups.
|
||||
optimizer.weight_decay = c.weight_decay
|
||||
# GPT uses a maximum learning rate of $6 \times 10^{-4}$
|
||||
# GPT uses a maximum learning rate of $6 \times 10^{-4}$.
|
||||
optimizer.learning_rate = 6e-4
|
||||
# $\beta_1 = 0.9, \beta_2 = 0.95$
|
||||
optimizer.betas = (0.9, 0.95)
|
||||
# $\epsilon = 10^{-8}$
|
||||
optimizer.eps = 1e-8
|
||||
# Weight decay decoupled from gradients
|
||||
# Weight decay is decoupled from gradients
|
||||
optimizer.weight_decouple = True
|
||||
# Total number of optimization steps for learning rate cosine decay
|
||||
optimizer.total_steps = c.epochs * len(c.text.train) // (c.batch_size * c.seq_len)
|
||||
|
@ -62,7 +62,7 @@ class TransformerLayer(Module):
|
||||
## Transformer Layer
|
||||
</a>
|
||||
|
||||
This can act as a encoder layer or a decoder layer.
|
||||
This can act as an encoder layer or a decoder layer.
|
||||
|
||||
🗒 Some implementations, including the paper seem to have differences
|
||||
in where the layer-normalization is done.
|
||||
@ -70,7 +70,7 @@ class TransformerLayer(Module):
|
||||
and add the original residual vectors.
|
||||
Alternative is to do a layer normalization after adding the residuals.
|
||||
But we found this to be less stable when training.
|
||||
We found a detailed discussion about this in paper
|
||||
We found a detailed discussion about this in the paper
|
||||
[On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745).
|
||||
"""
|
||||
|
||||
@ -220,7 +220,7 @@ class EncoderDecoder(Module):
|
||||
nn.init.xavier_uniform_(p)
|
||||
|
||||
def __call__(self, src: torch.Tensor, tgt: torch.Tensor, src_mask: torch.Tensor, tgt_mask: torch.Tensor):
|
||||
# Runs the source through encoder
|
||||
# Run the source through encoder
|
||||
enc = self.encode(src, src_mask)
|
||||
# Run encodings and targets through decoder
|
||||
return self.decode(enc, src_mask, tgt, tgt_mask)
|
||||
|
@ -61,8 +61,8 @@ class RelativeMultiHeadAttention(MultiHeadAttention):
|
||||
"""
|
||||
|
||||
def __init__(self, heads: int, d_model: int, dropout_prob: float = 0.1):
|
||||
# The linear transformations doesn't need a bias since we take care of it when
|
||||
# calculating scores.
|
||||
# The linear transformations do not need a bias since we
|
||||
# explicitly include it when calculating scores.
|
||||
# However having a bias for `value` might make sense.
|
||||
super().__init__(heads, d_model, dropout_prob, bias=False)
|
||||
|
||||
|
Reference in New Issue
Block a user