paper links

2025-11-03 13:57:48 +08:00 · 2021-08-17 14:12:33 +05:30
parent ff0d5c065d
commit 996b58be04
70 changed files with 92 additions and 92 deletions
--- a/labml_nn/transformers/init.py
+++ b/labml_nn/transformers/init.py
@ -10,7 +10,7 @@ summary: >

 This module contains [PyTorch](https://pytorch.org/)
 implementations and explanations of original transformer
-from paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762),
+from paper [Attention Is All You Need](https://papers.labml.ai/paper/1706.03762),
 and derivatives and enhancements of it.

 * [Multi-head attention](mha.html)
@ -34,34 +34,34 @@ This is an implementation of GPT-2 architecture.
 ## [GLU Variants](glu_variants/simple.html)

 This is an implementation of the paper
-[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202).
+[GLU Variants Improve Transformer](https://papers.labml.ai/paper/2002.05202).

 ## [kNN-LM](knn/index.html)

 This is an implementation of the paper
-[Generalization through Memorization: Nearest Neighbor Language Models](https://arxiv.org/abs/1911.00172).
+[Generalization through Memorization: Nearest Neighbor Language Models](https://papers.labml.ai/paper/1911.00172).

 ## [Feedback Transformer](feedback/index.html)

 This is an implementation of the paper
-[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402).
+[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://papers.labml.ai/paper/2002.09402).

 ## [Switch Transformer](switch/index.html)

 This is a miniature implementation of the paper
-[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
+[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://papers.labml.ai/paper/2101.03961).
 Our implementation only has a few million parameters and doesn't do model parallel distributed training.
 It does single GPU training but we implement the concept of switching as described in the paper.

 ## [Fast Weights Transformer](fast_weights/index.html)

 This is an implementation of the paper
-[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://arxiv.org/abs/2102.11174).
+[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://papers.labml.ai/paper/2102.11174).

 ## [FNet: Mixing Tokens with Fourier Transforms](fnet/index.html)

 This is an implementation of the paper
-[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824).
+[FNet: Mixing Tokens with Fourier Transforms](https://papers.labml.ai/paper/2105.03824).

 ## [Attention Free Transformer](aft/index.html)

@ -71,7 +71,7 @@ This is an implementation of the paper
 ## [Masked Language Model](mlm/index.html)

 This is an implementation of Masked Language Model used for pre-training in paper
-[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
+[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://papers.labml.ai/paper/1810.04805).

 ## [MLP-Mixer: An all-MLP Architecture for Vision](mlp_mixer/index.html)

@ -86,7 +86,7 @@ This is an implementation of the paper
 ## [Vision Transformer (ViT)](vit/index.html)

 This is an implementation of the paper
-[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929).
+[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://papers.labml.ai/paper/2010.11929).
 """

 from .configs import TransformerConfigs
--- a/labml_nn/transformers/basic/autoregressive_experiment.py
+++ b/labml_nn/transformers/basic/autoregressive_experiment.py
@ -7,7 +7,7 @@ summary: >

 # Transformer Auto-Regression Experiment

-This trains a simple transformer introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
+This trains a simple transformer introduced in [Attention Is All You Need](https://papers.labml.ai/paper/1706.03762)
 on an NLP auto-regression task (with Tiny Shakespeare dataset).
 """

--- a/labml_nn/transformers/compressive/init.py
+++ b/labml_nn/transformers/compressive/init.py
@ -9,7 +9,7 @@ summary: >
 # Compressive Transformer

 This is an implementation of
-[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
+[Compressive Transformers for Long-Range Sequence Modelling](https://papers.labml.ai/paper/1911.05507)
 in [PyTorch](https://pytorch.org).

 This is an extension of [Transformer XL](../xl/index.html) where past memories
--- a/labml_nn/transformers/compressive/readme.md
+++ b/labml_nn/transformers/compressive/readme.md
@ -1,7 +1,7 @@
 # [Compressive Transformer](https://nn.labml.ai/transformers/compressive/index.html)

 This is an implementation of
-[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
+[Compressive Transformers for Long-Range Sequence Modelling](https://papers.labml.ai/paper/1911.05507)
 in [PyTorch](https://pytorch.org).

 This is an extension of [Transformer XL](https://nn.labml.ai/transformers/xl/index.html) where past memories
--- a/labml_nn/transformers/configs.py
+++ b/labml_nn/transformers/configs.py
@ -66,7 +66,7 @@ def _ffn_activation_gelu():

    $$x \Phi(x)$$ where $\Phi(x) = P(X \le x), X \sim \mathcal{N}(0,1)$

-    It was introduced in paper [Gaussian Error Linear Units](https://arxiv.org/abs/1606.08415).
+    It was introduced in paper [Gaussian Error Linear Units](https://papers.labml.ai/paper/1606.08415).
    """
    return nn.GELU()

@ -86,7 +86,7 @@ def _feed_forward(c: FeedForwardConfigs):

 # ## GLU Variants
 # These are variants with gated hidden layers for the FFN
-# as introduced in paper [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202).
+# as introduced in paper [GLU Variants Improve Transformer](https://papers.labml.ai/paper/2002.05202).
 # We have omitted the bias terms as specified in the paper.

 # ### FFN with Gated Linear Units
--- a/labml_nn/transformers/fast_weights/init.py
+++ b/labml_nn/transformers/fast_weights/init.py
@ -9,7 +9,7 @@ summary: >
 # Fast weights transformer

 The paper
-[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://arxiv.org/abs/2102.11174)
+[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://papers.labml.ai/paper/2102.11174)
 finds similarities between linear self-attention and fast weight systems
 and makes modifications to self-attention update rule based on that.
 It also introduces a simpler, yet effective kernel function.
--- a/labml_nn/transformers/fast_weights/readme.md
+++ b/labml_nn/transformers/fast_weights/readme.md
@ -1,7 +1,7 @@
 # [Fast weights transformer](https://nn.labml.ai/transformers/fast_weights/index.html)

 This is an annotated implementation of the paper
-[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://arxiv.org/abs/2102.11174).
+[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://papers.labml.ai/paper/2102.11174).

 Here is the [annotated implementation](https://nn.labml.ai/transformers/fast_weights/index.html).
 Here are [the training code](https://nn.labml.ai/transformers/fast_weights/experiment.html)
--- a/labml_nn/transformers/feed_forward.py
+++ b/labml_nn/transformers/feed_forward.py
@ -28,7 +28,7 @@ $$x \Phi(x)$$ where $\Phi(x) = P(X \le x), X \sim \mathcal{N}(0,1)$
 ### Gated Linear Units

 This is a generic implementation that supports different variants including
-[Gated Linear Units](https://arxiv.org/abs/2002.05202) (GLU).
+[Gated Linear Units](https://papers.labml.ai/paper/2002.05202) (GLU).
 We have also implemented experiments on these:

 * [experiment that uses `labml.configs`](glu_variants/experiment.html)
--- a/labml_nn/transformers/feedback/init.py
+++ b/labml_nn/transformers/feedback/init.py
@ -8,7 +8,7 @@ summary: >
 # Feedback Transformer

 This is a [PyTorch](https://pytorch.org) implementation of the paper
-[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402).
+[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://papers.labml.ai/paper/2002.09402).

 Normal transformers process tokens in parallel. Each transformer layer pays attention
 to the outputs of the previous layer.
--- a/labml_nn/transformers/feedback/readme.md
+++ b/labml_nn/transformers/feedback/readme.md
@ -1,7 +1,7 @@
 # [Feedback Transformer](https://nn.labml.ai/transformers/feedback/index.html)

 This is a [PyTorch](https://pytorch.org) implementation of the paper
-[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402).
+[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://papers.labml.ai/paper/2002.09402).

 Normal transformers process tokens in parallel. Each transformer layer pays attention
 to the outputs of the previous layer.
--- a/labml_nn/transformers/fnet/init.py
+++ b/labml_nn/transformers/fnet/init.py
@ -8,7 +8,7 @@ summary: >
 # FNet: Mixing Tokens with Fourier Transforms

 This is a [PyTorch](https://pytorch.org) implementation of the paper
-[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824).
+[FNet: Mixing Tokens with Fourier Transforms](https://papers.labml.ai/paper/2105.03824).

 This paper replaces the [self-attention layer](../mha.html) with two
 [Fourier transforms](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) to
--- a/labml_nn/transformers/fnet/readme.md
+++ b/labml_nn/transformers/fnet/readme.md
@ -1,7 +1,7 @@
 # [FNet: Mixing Tokens with Fourier Transforms](https://nn.labml.ai/transformers/fnet/index.html)

 This is a [PyTorch](https://pytorch.org) implementation of the paper
-[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824).
+[FNet: Mixing Tokens with Fourier Transforms](https://papers.labml.ai/paper/2105.03824).

 This paper replaces the [self-attention layer](https://nn.labml.ai/transformers//mha.html) with two
 [Fourier transforms](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) to
--- a/labml_nn/transformers/knn/init.py
+++ b/labml_nn/transformers/knn/init.py
@ -12,7 +12,7 @@ summary: >
 # k-Nearest Neighbor Language Models

 This is a [PyTorch](https://pytorch.org) implementation of the paper
- [Generalization through Memorization: Nearest Neighbor Language Models](https://arxiv.org/abs/1911.00172).
+ [Generalization through Memorization: Nearest Neighbor Language Models](https://papers.labml.ai/paper/1911.00172).
 It uses k-nearest neighbors to  improve perplexity of autoregressive transformer models.

 An autoregressive language model estimates $p(w_t | \color{yellowgreen}{c_t})$,
--- a/labml_nn/transformers/mha.py
+++ b/labml_nn/transformers/mha.py
@ -9,7 +9,7 @@ summary: >
 # Multi-Headed Attention (MHA)

 This is a tutorial/implementation of multi-headed attention
-from paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
+from paper [Attention Is All You Need](https://papers.labml.ai/paper/1706.03762)
 in [PyTorch](https://pytorch.org/).
 The implementation is inspired from [Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html).

--- a/labml_nn/transformers/mlm/init.py
+++ b/labml_nn/transformers/mlm/init.py
@ -9,7 +9,7 @@ summary: >

 This is a [PyTorch](https://pytorch.org) implementation of the Masked Language Model (MLM)
 used to pre-train the BERT model introduced in the paper
-[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
+[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://papers.labml.ai/paper/1810.04805).

 ## BERT Pretraining

--- a/labml_nn/transformers/mlm/readme.md
+++ b/labml_nn/transformers/mlm/readme.md
@ -2,7 +2,7 @@

 This is a [PyTorch](https://pytorch.org) implementation of Masked Language Model (MLM)
 used to pre-train the BERT model introduced in the paper
-[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
+[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://papers.labml.ai/paper/1810.04805).

 ## BERT Pretraining

--- a/labml_nn/transformers/models.py
+++ b/labml_nn/transformers/models.py
@ -71,7 +71,7 @@ class TransformerLayer(Module):
    Alternative is to do a layer normalization after adding the residuals.
    But we found this to be less stable when training.
    We found a detailed discussion about this in the paper
-     [On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745).
+     [On Layer Normalization in the Transformer Architecture](https://papers.labml.ai/paper/2002.04745).
    """

    def __init__(self, *,
--- a/labml_nn/transformers/switch/init.py
+++ b/labml_nn/transformers/switch/init.py
@ -8,7 +8,7 @@ summary: >
 # Switch Transformer

 This is a miniature [PyTorch](https://pytorch.org) implementation of the paper
-[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
+[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://papers.labml.ai/paper/2101.03961).
 Our implementation only has a few million parameters and doesn't do model parallel distributed training.
 It does single GPU training, but we implement the concept of switching as described in the paper.

--- a/labml_nn/transformers/switch/readme.md
+++ b/labml_nn/transformers/switch/readme.md
@ -1,7 +1,7 @@
 # [Switch Transformer](https://nn.labml.ai/transformers/switch/index.html)

 This is a miniature [PyTorch](https://pytorch.org) implementation of the paper
-[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
+[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://papers.labml.ai/paper/2101.03961).
 Our implementation only has a few million parameters and doesn't do model parallel distributed training.
 It does single GPU training, but we implement the concept of switching as described in the paper.

--- a/labml_nn/transformers/vit/init.py
+++ b/labml_nn/transformers/vit/init.py
@ -9,7 +9,7 @@ summary: >
 #  Vision Transformer (ViT)

 This is a [PyTorch](https://pytorch.org) implementation of the paper
-[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929).
+[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://papers.labml.ai/paper/2010.11929).

 Vision transformer applies a pure transformer to images
 without any convolution layers.
--- a/labml_nn/transformers/vit/readme.md
+++ b/labml_nn/transformers/vit/readme.md
@ -1,7 +1,7 @@
 #  [Vision Transformer (ViT)](https://nn.labml.ai/transformer/vit/index.html)

 This is a [PyTorch](https://pytorch.org) implementation of the paper
-[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929).
+[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://papers.labml.ai/paper/2010.11929).

 Vision transformer applies a pure transformer to images
 without any convolution layers.
--- a/labml_nn/transformers/xl/init.py
+++ b/labml_nn/transformers/xl/init.py
@ -9,7 +9,7 @@ summary: >
 # Transformer XL

 This is an implementation of
-[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
+[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://papers.labml.ai/paper/1901.02860)
 in [PyTorch](https://pytorch.org).

 Transformer has a limited attention span,
--- a/labml_nn/transformers/xl/readme.md
+++ b/labml_nn/transformers/xl/readme.md
@ -1,7 +1,7 @@
 # [Transformer XL](https://nn.labml.ai/transformers/xl/index.html)

 This is an implementation of
-[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
+[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://papers.labml.ai/paper/1901.02860)
 in [PyTorch](https://pytorch.org).

 Transformer has a limited attention span,
--- a/labml_nn/transformers/xl/relative_mha.py
+++ b/labml_nn/transformers/xl/relative_mha.py
@ -9,7 +9,7 @@ summary: >
 # Relative Multi-Headed Attention

 This is an implementation of relative multi-headed attention from paper
-[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
+[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://papers.labml.ai/paper/1901.02860)
 in [PyTorch](https://pytorch.org).
 """