✍️ english

2025-08-26 08:41:23 +08:00 · 2021-02-01 07:33:01 +05:30
parent 5cd2b8701b
commit 88d0f89ef5
6 changed files with 30 additions and 30 deletions
--- a/docs/transformers/gpt/index.html
+++ b/docs/transformers/gpt/index.html
@ -73,7 +73,7 @@
                    <a href='#section-0'>#</a>
                </div>
                <h1>GPT</h1>
-<p>This is an tutorial/implementation of
+<p>This is a tutorial/implementation of
 <a href="https://openai.com/blog/better-language-models/">OpenAI GPT architecture</a>
 in <a href="https://pytorch.org">PyTorch</a>.
 We got a bunch of implementation details from
@ -85,7 +85,7 @@ GPT-2 and especially GPT-3 models are quite large and won&rsquo;t fit on a
 single GPU and will need model parallelism.
 This implementation doesn&rsquo;t even use data parallelism and is intended to be
 more of a tutorial.</p>
-<p>Main differences of this to a standard autoregressive transformer
+<p>Main differences of this compared to a simple autoregressive transformer
 are the parameter initialization, weight decay, and learning rate schedule.
 For the transformer we reuse the
 <a href="../transformers/index.html">existing labml/nn transformer implementation</a>.</p>
@ -516,7 +516,7 @@ This applies weight decay only to weights of linear layers.</p>
                    <a href='#section-34'>#</a>
                </div>
                <p>Create a <a href="../optimizers/configs.html#OptimizerConfigs">configurable optimizer</a>,
-so that we can change these simple by passing
+so that we can change these simply by passing
 a config dictionary.</p>
            </div>
            <div class='code'>
@ -528,7 +528,7 @@ a config dictionary.</p>
                <div class='section-link'>
                    <a href='#section-35'>#</a>
                </div>
-                <p>Set parameter groups for optimization</p>
+                <p>Set parameter groups for optimization.</p>
            </div>
            <div class='code'>
                <div class="highlight"><pre><span class="lineno">196</span>    <span class="n">optimizer</span><span class="o">.</span><span class="n">parameters</span> <span class="o">=</span> <span class="n">opt_groups</span></pre></div>
@ -539,8 +539,8 @@ a config dictionary.</p>
                <div class='section-link'>
                    <a href='#section-36'>#</a>
                </div>
-                <p>Use <a href="../optimizers/adam_warmup_cosine_decay.html">cosine decay optimizer</a>
-This is what GPT uses</p>
+                <p>Use <a href="../optimizers/adam_warmup_cosine_decay.html">cosine decay optimizer</a>.
+This is what GPT uses.</p>
            </div>
            <div class='code'>
                <div class="highlight"><pre><span class="lineno">199</span>    <span class="n">optimizer</span><span class="o">.</span><span class="n">optimizer</span> <span class="o">=</span> <span class="s1">&#39;AdamWarmupCosineDecay&#39;</span></pre></div>
@ -552,7 +552,7 @@ This is what GPT uses</p>
                    <a href='#section-37'>#</a>
                </div>
                <p>Set model embedding size, required if we use <a href="../optimizers/noam.html">Noam optimizer</a>
-which has an exponential decay</p>
+which has an exponential decay.</p>
            </div>
            <div class='code'>
                <div class="highlight"><pre><span class="lineno">202</span>    <span class="n">optimizer</span><span class="o">.</span><span class="n">d_model</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">d_model</span></pre></div>
@ -564,7 +564,7 @@ which has an exponential decay</p>
                    <a href='#section-38'>#</a>
                </div>
                <p>Set default weight decay.
-This is not required since we set the weight decay in the parameter groups</p>
+This is not required since we set the weight decay in the parameter groups.</p>
            </div>
            <div class='code'>
                <div class="highlight"><pre><span class="lineno">205</span>    <span class="n">optimizer</span><span class="o">.</span><span class="n">weight_decay</span> <span class="o">=</span> <span class="n">c</span><span class="o">.</span><span class="n">weight_decay</span></pre></div>
@ -575,7 +575,7 @@ This is not required since we set the weight decay in the parameter groups</p>
                <div class='section-link'>
                    <a href='#section-39'>#</a>
                </div>
-                <p>GPT uses a maximum learning rate of $6 \times 10^{-4}$</p>
+                <p>GPT uses a maximum learning rate of $6 \times 10^{-4}$.</p>
            </div>
            <div class='code'>
                <div class="highlight"><pre><span class="lineno">207</span>    <span class="n">optimizer</span><span class="o">.</span><span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">6e-4</span></pre></div>
@ -608,7 +608,7 @@ This is not required since we set the weight decay in the parameter groups</p>
                <div class='section-link'>
                    <a href='#section-42'>#</a>
                </div>
-                <p>Weight decay decoupled from gradients</p>
+                <p>Weight decay is decoupled from gradients</p>
            </div>
            <div class='code'>
                <div class="highlight"><pre><span class="lineno">213</span>    <span class="n">optimizer</span><span class="o">.</span><span class="n">weight_decouple</span> <span class="o">=</span> <span class="kc">True</span></pre></div>
--- a/docs/transformers/models.html
+++ b/docs/transformers/models.html
@ -176,14 +176,14 @@
                <p><a id="TransformerLayer"></p>
 <h2>Transformer Layer</h2>
 <p></a></p>
-<p>This can act as a encoder layer or a decoder layer.</p>
+<p>This can act as an encoder layer or a decoder layer.</p>
 <p>🗒 Some implementations, including the paper seem to have differences
 in where the layer-normalization is done.
 Here we do a layer normalization before attention and feed-forward networks,
 and add the original residual vectors.
 Alternative is to do a layer normalization after adding the residuals.
 But we found this to be less stable when training.
-We found a detailed discussion about this in paper
+We found a detailed discussion about this in the paper
 <a href="https://arxiv.org/abs/2002.04745">On Layer Normalization in the Transformer Architecture</a>.</p>
            </div>
            <div class='code'>
@ -646,7 +646,7 @@ Initialize parameters with Glorot / fan_avg.</p>
                <div class='section-link'>
                    <a href='#section-44'>#</a>
                </div>
-                <p>Runs the source through encoder</p>
+                <p>Run the source through encoder</p>
            </div>
            <div class='code'>
                <div class="highlight"><pre><span class="lineno">224</span>        <span class="n">enc</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">src</span><span class="p">,</span> <span class="n">src_mask</span><span class="p">)</span></pre></div>
--- a/docs/transformers/relative_mha.html
+++ b/docs/transformers/relative_mha.html
@ -164,8 +164,8 @@ write the <code>get_scores</code> method.</p>
                <div class='section-link'>
                    <a href='#section-6'>#</a>
                </div>
-                <p>The linear transformations doesn&rsquo;t need a bias since we take care of it when
-calculating scores.
+                <p>The linear transformations do not need a bias since we
+explicitly include it when calculating scores.
 However having a bias for <code>value</code> might make sense.</p>
            </div>
            <div class='code'>
--- a/labml_nn/transformers/gpt/init.py
+++ b/labml_nn/transformers/gpt/init.py
@ -7,7 +7,7 @@ summary: >

 # GPT

-This is an tutorial/implementation of
+This is a tutorial/implementation of
 [OpenAI GPT architecture](https://openai.com/blog/better-language-models/)
 in [PyTorch](https://pytorch.org).
 We got a bunch of implementation details from
@ -21,7 +21,7 @@ single GPU and will need model parallelism.
 This implementation doesn't even use data parallelism and is intended to be
 more of a tutorial.

-Main differences of this to a standard autoregressive transformer
+Main differences of this compared to a simple autoregressive transformer
 are the parameter initialization, weight decay, and learning rate schedule.
 For the transformer we reuse the
 [existing labml/nn transformer implementation](../transformers/index.html).
@ -188,28 +188,28 @@ def transformer_optimizer(c: NLPAutoRegressionConfigs):
    ]

    # Create a [configurable optimizer](../optimizers/configs.html#OptimizerConfigs),
-    # so that we can change these simple by passing
+    # so that we can change these simply by passing
    # a config dictionary.
    optimizer = OptimizerConfigs()

-    # Set parameter groups for optimization
+    # Set parameter groups for optimization.
    optimizer.parameters = opt_groups
-    # Use [cosine decay optimizer](../optimizers/adam_warmup_cosine_decay.html)
-    # This is what GPT uses
+    # Use [cosine decay optimizer](../optimizers/adam_warmup_cosine_decay.html).
+    # This is what GPT uses.
    optimizer.optimizer = 'AdamWarmupCosineDecay'
    # Set model embedding size, required if we use [Noam optimizer](../optimizers/noam.html)
-    # which has an exponential decay
+    # which has an exponential decay.
    optimizer.d_model = c.d_model
    # Set default weight decay.
-    # This is not required since we set the weight decay in the parameter groups
+    # This is not required since we set the weight decay in the parameter groups.
    optimizer.weight_decay = c.weight_decay
-    # GPT uses a maximum learning rate of $6 \times 10^{-4}$
+    # GPT uses a maximum learning rate of $6 \times 10^{-4}$.
    optimizer.learning_rate = 6e-4
    # $\beta_1 = 0.9, \beta_2 = 0.95$
    optimizer.betas = (0.9, 0.95)
    # $\epsilon = 10^{-8}$
    optimizer.eps = 1e-8
-    # Weight decay decoupled from gradients
+    # Weight decay is decoupled from gradients
    optimizer.weight_decouple = True
    # Total number of optimization steps for learning rate cosine decay
    optimizer.total_steps = c.epochs * len(c.text.train) // (c.batch_size * c.seq_len)
--- a/labml_nn/transformers/models.py
+++ b/labml_nn/transformers/models.py
@ -62,7 +62,7 @@ class TransformerLayer(Module):
    ## Transformer Layer
    </a>

-    This can act as a encoder layer or a decoder layer.
+    This can act as an encoder layer or a decoder layer.

    🗒 Some implementations, including the paper seem to have differences
    in where the layer-normalization is done.
@ -70,7 +70,7 @@ class TransformerLayer(Module):
    and add the original residual vectors.
    Alternative is to do a layer normalization after adding the residuals.
    But we found this to be less stable when training.
-    We found a detailed discussion about this in paper
+    We found a detailed discussion about this in the paper
     [On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745).
    """

@ -220,7 +220,7 @@ class EncoderDecoder(Module):
                nn.init.xavier_uniform_(p)

    def __call__(self, src: torch.Tensor, tgt: torch.Tensor, src_mask: torch.Tensor, tgt_mask: torch.Tensor):
-        # Runs the source through encoder
+        # Run the source through encoder
        enc = self.encode(src, src_mask)
        # Run encodings and targets through decoder
        return self.decode(enc, src_mask, tgt, tgt_mask)
--- a/labml_nn/transformers/relative_mha.py
+++ b/labml_nn/transformers/relative_mha.py
@ -61,8 +61,8 @@ class RelativeMultiHeadAttention(MultiHeadAttention):
    """

    def __init__(self, heads: int, d_model: int, dropout_prob: float = 0.1):
-        # The linear transformations doesn't need a bias since we take care of it when
-        # calculating scores.
+        # The linear transformations do not need a bias since we
+        # explicitly include it when calculating scores.
        # However having a bias for `value` might make sense.
        super().__init__(heads, d_model, dropout_prob, bias=False)