✍️ english

This commit is contained in:
Varuna Jayasiri
2021-02-01 07:33:01 +05:30
parent 5cd2b8701b
commit 88d0f89ef5
6 changed files with 30 additions and 30 deletions

View File

@ -73,7 +73,7 @@
<a href='#section-0'>#</a>
</div>
<h1>GPT</h1>
<p>This is an tutorial/implementation of
<p>This is a tutorial/implementation of
<a href="https://openai.com/blog/better-language-models/">OpenAI GPT architecture</a>
in <a href="https://pytorch.org">PyTorch</a>.
We got a bunch of implementation details from
@ -85,7 +85,7 @@ GPT-2 and especially GPT-3 models are quite large and won&rsquo;t fit on a
single GPU and will need model parallelism.
This implementation doesn&rsquo;t even use data parallelism and is intended to be
more of a tutorial.</p>
<p>Main differences of this to a standard autoregressive transformer
<p>Main differences of this compared to a simple autoregressive transformer
are the parameter initialization, weight decay, and learning rate schedule.
For the transformer we reuse the
<a href="../transformers/index.html">existing labml/nn transformer implementation</a>.</p>
@ -516,7 +516,7 @@ This applies weight decay only to weights of linear layers.</p>
<a href='#section-34'>#</a>
</div>
<p>Create a <a href="../optimizers/configs.html#OptimizerConfigs">configurable optimizer</a>,
so that we can change these simple by passing
so that we can change these simply by passing
a config dictionary.</p>
</div>
<div class='code'>
@ -528,7 +528,7 @@ a config dictionary.</p>
<div class='section-link'>
<a href='#section-35'>#</a>
</div>
<p>Set parameter groups for optimization</p>
<p>Set parameter groups for optimization.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">196</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">parameters</span> <span class="o">=</span> <span class="n">opt_groups</span></pre></div>
@ -539,8 +539,8 @@ a config dictionary.</p>
<div class='section-link'>
<a href='#section-36'>#</a>
</div>
<p>Use <a href="../optimizers/adam_warmup_cosine_decay.html">cosine decay optimizer</a>
This is what GPT uses</p>
<p>Use <a href="../optimizers/adam_warmup_cosine_decay.html">cosine decay optimizer</a>.
This is what GPT uses.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">199</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">optimizer</span> <span class="o">=</span> <span class="s1">&#39;AdamWarmupCosineDecay&#39;</span></pre></div>
@ -552,7 +552,7 @@ This is what GPT uses</p>
<a href='#section-37'>#</a>
</div>
<p>Set model embedding size, required if we use <a href="../optimizers/noam.html">Noam optimizer</a>
which has an exponential decay</p>
which has an exponential decay.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">202</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">d_model</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">d_model</span></pre></div>
@ -564,7 +564,7 @@ which has an exponential decay</p>
<a href='#section-38'>#</a>
</div>
<p>Set default weight decay.
This is not required since we set the weight decay in the parameter groups</p>
This is not required since we set the weight decay in the parameter groups.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">205</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">weight_decay</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">weight_decay</span></pre></div>
@ -575,7 +575,7 @@ This is not required since we set the weight decay in the parameter groups</p>
<div class='section-link'>
<a href='#section-39'>#</a>
</div>
<p>GPT uses a maximum learning rate of $6 \times 10^{-4}$</p>
<p>GPT uses a maximum learning rate of $6 \times 10^{-4}$.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">207</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">6e-4</span></pre></div>
@ -608,7 +608,7 @@ This is not required since we set the weight decay in the parameter groups</p>
<div class='section-link'>
<a href='#section-42'>#</a>
</div>
<p>Weight decay decoupled from gradients</p>
<p>Weight decay is decoupled from gradients</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">213</span> <span class="n">optimizer</span><span class="o">.</span><span class="n">weight_decouple</span> <span class="o">=</span> <span class="kc">True</span></pre></div>

View File

@ -176,14 +176,14 @@
<p><a id="TransformerLayer"></p>
<h2>Transformer Layer</h2>
<p></a></p>
<p>This can act as a encoder layer or a decoder layer.</p>
<p>This can act as an encoder layer or a decoder layer.</p>
<p>🗒 Some implementations, including the paper seem to have differences
in where the layer-normalization is done.
Here we do a layer normalization before attention and feed-forward networks,
and add the original residual vectors.
Alternative is to do a layer normalization after adding the residuals.
But we found this to be less stable when training.
We found a detailed discussion about this in paper
We found a detailed discussion about this in the paper
<a href="https://arxiv.org/abs/2002.04745">On Layer Normalization in the Transformer Architecture</a>.</p>
</div>
<div class='code'>
@ -646,7 +646,7 @@ Initialize parameters with Glorot / fan_avg.</p>
<div class='section-link'>
<a href='#section-44'>#</a>
</div>
<p>Runs the source through encoder</p>
<p>Run the source through encoder</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">224</span> <span class="n">enc</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">)</span></pre></div>

View File

@ -164,8 +164,8 @@ write the <code>get_scores</code> method.</p>
<div class='section-link'>
<a href='#section-6'>#</a>
</div>
<p>The linear transformations doesn&rsquo;t need a bias since we take care of it when
calculating scores.
<p>The linear transformations do not need a bias since we
explicitly include it when calculating scores.
However having a bias for <code>value</code> might make sense.</p>
</div>
<div class='code'>

View File

@ -7,7 +7,7 @@ summary: >
# GPT
This is an tutorial/implementation of
This is a tutorial/implementation of
[OpenAI GPT architecture](https://openai.com/blog/better-language-models/)
in [PyTorch](https://pytorch.org).
We got a bunch of implementation details from
@ -21,7 +21,7 @@ single GPU and will need model parallelism.
This implementation doesn't even use data parallelism and is intended to be
more of a tutorial.
Main differences of this to a standard autoregressive transformer
Main differences of this compared to a simple autoregressive transformer
are the parameter initialization, weight decay, and learning rate schedule.
For the transformer we reuse the
[existing labml/nn transformer implementation](../transformers/index.html).
@ -188,28 +188,28 @@ def transformer_optimizer(c: NLPAutoRegressionConfigs):
]
# Create a [configurable optimizer](../optimizers/configs.html#OptimizerConfigs),
# so that we can change these simple by passing
# so that we can change these simply by passing
# a config dictionary.
optimizer = OptimizerConfigs()
# Set parameter groups for optimization
# Set parameter groups for optimization.
optimizer.parameters = opt_groups
# Use [cosine decay optimizer](../optimizers/adam_warmup_cosine_decay.html)
# This is what GPT uses
# Use [cosine decay optimizer](../optimizers/adam_warmup_cosine_decay.html).
# This is what GPT uses.
optimizer.optimizer = 'AdamWarmupCosineDecay'
# Set model embedding size, required if we use [Noam optimizer](../optimizers/noam.html)
# which has an exponential decay
# which has an exponential decay.
optimizer.d_model = c.d_model
# Set default weight decay.
# This is not required since we set the weight decay in the parameter groups
# This is not required since we set the weight decay in the parameter groups.
optimizer.weight_decay = c.weight_decay
# GPT uses a maximum learning rate of $6 \times 10^{-4}$
# GPT uses a maximum learning rate of $6 \times 10^{-4}$.
optimizer.learning_rate = 6e-4
# $\beta_1 = 0.9, \beta_2 = 0.95$
optimizer.betas = (0.9, 0.95)
# $\epsilon = 10^{-8}$
optimizer.eps = 1e-8
# Weight decay decoupled from gradients
# Weight decay is decoupled from gradients
optimizer.weight_decouple = True
# Total number of optimization steps for learning rate cosine decay
optimizer.total_steps = c.epochs * len(c.text.train) // (c.batch_size * c.seq_len)

View File

@ -62,7 +62,7 @@ class TransformerLayer(Module):
## Transformer Layer
</a>
This can act as a encoder layer or a decoder layer.
This can act as an encoder layer or a decoder layer.
🗒 Some implementations, including the paper seem to have differences
in where the layer-normalization is done.
@ -70,7 +70,7 @@ class TransformerLayer(Module):
and add the original residual vectors.
Alternative is to do a layer normalization after adding the residuals.
But we found this to be less stable when training.
We found a detailed discussion about this in paper
We found a detailed discussion about this in the paper
[On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745).
"""
@ -220,7 +220,7 @@ class EncoderDecoder(Module):
nn.init.xavier_uniform_(p)
def __call__(self, src: torch.Tensor, tgt: torch.Tensor, src_mask: torch.Tensor, tgt_mask: torch.Tensor):
# Runs the source through encoder
# Run the source through encoder
enc = self.encode(src, src_mask)
# Run encodings and targets through decoder
return self.decode(enc, src_mask, tgt, tgt_mask)

View File

@ -61,8 +61,8 @@ class RelativeMultiHeadAttention(MultiHeadAttention):
"""
def __init__(self, heads: int, d_model: int, dropout_prob: float = 0.1):
# The linear transformations doesn't need a bias since we take care of it when
# calculating scores.
# The linear transformations do not need a bias since we
# explicitly include it when calculating scores.
# However having a bias for `value` might make sense.
super().__init__(heads, d_model, dropout_prob, bias=False)