paper links

This commit is contained in:
Varuna Jayasiri
2021-08-17 14:12:33 +05:30
parent ff0d5c065d
commit 996b58be04
70 changed files with 92 additions and 92 deletions

View File

@ -10,7 +10,7 @@ summary: >
This module contains [PyTorch](https://pytorch.org/)
implementations and explanations of original transformer
from paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762),
from paper [Attention Is All You Need](https://papers.labml.ai/paper/1706.03762),
and derivatives and enhancements of it.
* [Multi-head attention](mha.html)
@ -34,34 +34,34 @@ This is an implementation of GPT-2 architecture.
## [GLU Variants](glu_variants/simple.html)
This is an implementation of the paper
[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202).
[GLU Variants Improve Transformer](https://papers.labml.ai/paper/2002.05202).
## [kNN-LM](knn/index.html)
This is an implementation of the paper
[Generalization through Memorization: Nearest Neighbor Language Models](https://arxiv.org/abs/1911.00172).
[Generalization through Memorization: Nearest Neighbor Language Models](https://papers.labml.ai/paper/1911.00172).
## [Feedback Transformer](feedback/index.html)
This is an implementation of the paper
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402).
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://papers.labml.ai/paper/2002.09402).
## [Switch Transformer](switch/index.html)
This is a miniature implementation of the paper
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://papers.labml.ai/paper/2101.03961).
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
It does single GPU training but we implement the concept of switching as described in the paper.
## [Fast Weights Transformer](fast_weights/index.html)
This is an implementation of the paper
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://arxiv.org/abs/2102.11174).
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://papers.labml.ai/paper/2102.11174).
## [FNet: Mixing Tokens with Fourier Transforms](fnet/index.html)
This is an implementation of the paper
[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824).
[FNet: Mixing Tokens with Fourier Transforms](https://papers.labml.ai/paper/2105.03824).
## [Attention Free Transformer](aft/index.html)
@ -71,7 +71,7 @@ This is an implementation of the paper
## [Masked Language Model](mlm/index.html)
This is an implementation of Masked Language Model used for pre-training in paper
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://papers.labml.ai/paper/1810.04805).
## [MLP-Mixer: An all-MLP Architecture for Vision](mlp_mixer/index.html)
@ -86,7 +86,7 @@ This is an implementation of the paper
## [Vision Transformer (ViT)](vit/index.html)
This is an implementation of the paper
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929).
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://papers.labml.ai/paper/2010.11929).
"""
from .configs import TransformerConfigs

View File

@ -7,7 +7,7 @@ summary: >
# Transformer Auto-Regression Experiment
This trains a simple transformer introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
This trains a simple transformer introduced in [Attention Is All You Need](https://papers.labml.ai/paper/1706.03762)
on an NLP auto-regression task (with Tiny Shakespeare dataset).
"""

View File

@ -9,7 +9,7 @@ summary: >
# Compressive Transformer
This is an implementation of
[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
[Compressive Transformers for Long-Range Sequence Modelling](https://papers.labml.ai/paper/1911.05507)
in [PyTorch](https://pytorch.org).
This is an extension of [Transformer XL](../xl/index.html) where past memories

View File

@ -1,7 +1,7 @@
# [Compressive Transformer](https://nn.labml.ai/transformers/compressive/index.html)
This is an implementation of
[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
[Compressive Transformers for Long-Range Sequence Modelling](https://papers.labml.ai/paper/1911.05507)
in [PyTorch](https://pytorch.org).
This is an extension of [Transformer XL](https://nn.labml.ai/transformers/xl/index.html) where past memories

View File

@ -66,7 +66,7 @@ def _ffn_activation_gelu():
$$x \Phi(x)$$ where $\Phi(x) = P(X \le x), X \sim \mathcal{N}(0,1)$
It was introduced in paper [Gaussian Error Linear Units](https://arxiv.org/abs/1606.08415).
It was introduced in paper [Gaussian Error Linear Units](https://papers.labml.ai/paper/1606.08415).
"""
return nn.GELU()
@ -86,7 +86,7 @@ def _feed_forward(c: FeedForwardConfigs):
# ## GLU Variants
# These are variants with gated hidden layers for the FFN
# as introduced in paper [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202).
# as introduced in paper [GLU Variants Improve Transformer](https://papers.labml.ai/paper/2002.05202).
# We have omitted the bias terms as specified in the paper.
# ### FFN with Gated Linear Units

View File

@ -9,7 +9,7 @@ summary: >
# Fast weights transformer
The paper
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://arxiv.org/abs/2102.11174)
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://papers.labml.ai/paper/2102.11174)
finds similarities between linear self-attention and fast weight systems
and makes modifications to self-attention update rule based on that.
It also introduces a simpler, yet effective kernel function.

View File

@ -1,7 +1,7 @@
# [Fast weights transformer](https://nn.labml.ai/transformers/fast_weights/index.html)
This is an annotated implementation of the paper
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://arxiv.org/abs/2102.11174).
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://papers.labml.ai/paper/2102.11174).
Here is the [annotated implementation](https://nn.labml.ai/transformers/fast_weights/index.html).
Here are [the training code](https://nn.labml.ai/transformers/fast_weights/experiment.html)

View File

@ -28,7 +28,7 @@ $$x \Phi(x)$$ where $\Phi(x) = P(X \le x), X \sim \mathcal{N}(0,1)$
### Gated Linear Units
This is a generic implementation that supports different variants including
[Gated Linear Units](https://arxiv.org/abs/2002.05202) (GLU).
[Gated Linear Units](https://papers.labml.ai/paper/2002.05202) (GLU).
We have also implemented experiments on these:
* [experiment that uses `labml.configs`](glu_variants/experiment.html)

View File

@ -8,7 +8,7 @@ summary: >
# Feedback Transformer
This is a [PyTorch](https://pytorch.org) implementation of the paper
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402).
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://papers.labml.ai/paper/2002.09402).
Normal transformers process tokens in parallel. Each transformer layer pays attention
to the outputs of the previous layer.

View File

@ -1,7 +1,7 @@
# [Feedback Transformer](https://nn.labml.ai/transformers/feedback/index.html)
This is a [PyTorch](https://pytorch.org) implementation of the paper
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402).
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://papers.labml.ai/paper/2002.09402).
Normal transformers process tokens in parallel. Each transformer layer pays attention
to the outputs of the previous layer.

View File

@ -8,7 +8,7 @@ summary: >
# FNet: Mixing Tokens with Fourier Transforms
This is a [PyTorch](https://pytorch.org) implementation of the paper
[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824).
[FNet: Mixing Tokens with Fourier Transforms](https://papers.labml.ai/paper/2105.03824).
This paper replaces the [self-attention layer](../mha.html) with two
[Fourier transforms](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) to

View File

@ -1,7 +1,7 @@
# [FNet: Mixing Tokens with Fourier Transforms](https://nn.labml.ai/transformers/fnet/index.html)
This is a [PyTorch](https://pytorch.org) implementation of the paper
[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824).
[FNet: Mixing Tokens with Fourier Transforms](https://papers.labml.ai/paper/2105.03824).
This paper replaces the [self-attention layer](https://nn.labml.ai/transformers//mha.html) with two
[Fourier transforms](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) to

View File

@ -12,7 +12,7 @@ summary: >
# k-Nearest Neighbor Language Models
This is a [PyTorch](https://pytorch.org) implementation of the paper
[Generalization through Memorization: Nearest Neighbor Language Models](https://arxiv.org/abs/1911.00172).
[Generalization through Memorization: Nearest Neighbor Language Models](https://papers.labml.ai/paper/1911.00172).
It uses k-nearest neighbors to improve perplexity of autoregressive transformer models.
An autoregressive language model estimates $p(w_t | \color{yellowgreen}{c_t})$,

View File

@ -9,7 +9,7 @@ summary: >
# Multi-Headed Attention (MHA)
This is a tutorial/implementation of multi-headed attention
from paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
from paper [Attention Is All You Need](https://papers.labml.ai/paper/1706.03762)
in [PyTorch](https://pytorch.org/).
The implementation is inspired from [Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html).

View File

@ -9,7 +9,7 @@ summary: >
This is a [PyTorch](https://pytorch.org) implementation of the Masked Language Model (MLM)
used to pre-train the BERT model introduced in the paper
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://papers.labml.ai/paper/1810.04805).
## BERT Pretraining

View File

@ -2,7 +2,7 @@
This is a [PyTorch](https://pytorch.org) implementation of Masked Language Model (MLM)
used to pre-train the BERT model introduced in the paper
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://papers.labml.ai/paper/1810.04805).
## BERT Pretraining

View File

@ -71,7 +71,7 @@ class TransformerLayer(Module):
Alternative is to do a layer normalization after adding the residuals.
But we found this to be less stable when training.
We found a detailed discussion about this in the paper
[On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745).
[On Layer Normalization in the Transformer Architecture](https://papers.labml.ai/paper/2002.04745).
"""
def __init__(self, *,

View File

@ -8,7 +8,7 @@ summary: >
# Switch Transformer
This is a miniature [PyTorch](https://pytorch.org) implementation of the paper
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://papers.labml.ai/paper/2101.03961).
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
It does single GPU training, but we implement the concept of switching as described in the paper.

View File

@ -1,7 +1,7 @@
# [Switch Transformer](https://nn.labml.ai/transformers/switch/index.html)
This is a miniature [PyTorch](https://pytorch.org) implementation of the paper
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://papers.labml.ai/paper/2101.03961).
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
It does single GPU training, but we implement the concept of switching as described in the paper.

View File

@ -9,7 +9,7 @@ summary: >
# Vision Transformer (ViT)
This is a [PyTorch](https://pytorch.org) implementation of the paper
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929).
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://papers.labml.ai/paper/2010.11929).
Vision transformer applies a pure transformer to images
without any convolution layers.

View File

@ -1,7 +1,7 @@
# [Vision Transformer (ViT)](https://nn.labml.ai/transformer/vit/index.html)
This is a [PyTorch](https://pytorch.org) implementation of the paper
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929).
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://papers.labml.ai/paper/2010.11929).
Vision transformer applies a pure transformer to images
without any convolution layers.

View File

@ -9,7 +9,7 @@ summary: >
# Transformer XL
This is an implementation of
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://papers.labml.ai/paper/1901.02860)
in [PyTorch](https://pytorch.org).
Transformer has a limited attention span,

View File

@ -1,7 +1,7 @@
# [Transformer XL](https://nn.labml.ai/transformers/xl/index.html)
This is an implementation of
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://papers.labml.ai/paper/1901.02860)
in [PyTorch](https://pytorch.org).
Transformer has a limited attention span,

View File

@ -9,7 +9,7 @@ summary: >
# Relative Multi-Headed Attention
This is an implementation of relative multi-headed attention from paper
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://papers.labml.ai/paper/1901.02860)
in [PyTorch](https://pytorch.org).
"""