mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-11-03 13:57:48 +08:00
paper links
This commit is contained in:
@ -10,7 +10,7 @@ summary: >
|
||||
|
||||
This module contains [PyTorch](https://pytorch.org/)
|
||||
implementations and explanations of original transformer
|
||||
from paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762),
|
||||
from paper [Attention Is All You Need](https://papers.labml.ai/paper/1706.03762),
|
||||
and derivatives and enhancements of it.
|
||||
|
||||
* [Multi-head attention](mha.html)
|
||||
@ -34,34 +34,34 @@ This is an implementation of GPT-2 architecture.
|
||||
## [GLU Variants](glu_variants/simple.html)
|
||||
|
||||
This is an implementation of the paper
|
||||
[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202).
|
||||
[GLU Variants Improve Transformer](https://papers.labml.ai/paper/2002.05202).
|
||||
|
||||
## [kNN-LM](knn/index.html)
|
||||
|
||||
This is an implementation of the paper
|
||||
[Generalization through Memorization: Nearest Neighbor Language Models](https://arxiv.org/abs/1911.00172).
|
||||
[Generalization through Memorization: Nearest Neighbor Language Models](https://papers.labml.ai/paper/1911.00172).
|
||||
|
||||
## [Feedback Transformer](feedback/index.html)
|
||||
|
||||
This is an implementation of the paper
|
||||
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402).
|
||||
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://papers.labml.ai/paper/2002.09402).
|
||||
|
||||
## [Switch Transformer](switch/index.html)
|
||||
|
||||
This is a miniature implementation of the paper
|
||||
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
|
||||
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://papers.labml.ai/paper/2101.03961).
|
||||
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
|
||||
It does single GPU training but we implement the concept of switching as described in the paper.
|
||||
|
||||
## [Fast Weights Transformer](fast_weights/index.html)
|
||||
|
||||
This is an implementation of the paper
|
||||
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://arxiv.org/abs/2102.11174).
|
||||
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://papers.labml.ai/paper/2102.11174).
|
||||
|
||||
## [FNet: Mixing Tokens with Fourier Transforms](fnet/index.html)
|
||||
|
||||
This is an implementation of the paper
|
||||
[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824).
|
||||
[FNet: Mixing Tokens with Fourier Transforms](https://papers.labml.ai/paper/2105.03824).
|
||||
|
||||
## [Attention Free Transformer](aft/index.html)
|
||||
|
||||
@ -71,7 +71,7 @@ This is an implementation of the paper
|
||||
## [Masked Language Model](mlm/index.html)
|
||||
|
||||
This is an implementation of Masked Language Model used for pre-training in paper
|
||||
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
|
||||
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://papers.labml.ai/paper/1810.04805).
|
||||
|
||||
## [MLP-Mixer: An all-MLP Architecture for Vision](mlp_mixer/index.html)
|
||||
|
||||
@ -86,7 +86,7 @@ This is an implementation of the paper
|
||||
## [Vision Transformer (ViT)](vit/index.html)
|
||||
|
||||
This is an implementation of the paper
|
||||
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929).
|
||||
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://papers.labml.ai/paper/2010.11929).
|
||||
"""
|
||||
|
||||
from .configs import TransformerConfigs
|
||||
|
||||
@ -7,7 +7,7 @@ summary: >
|
||||
|
||||
# Transformer Auto-Regression Experiment
|
||||
|
||||
This trains a simple transformer introduced in [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
|
||||
This trains a simple transformer introduced in [Attention Is All You Need](https://papers.labml.ai/paper/1706.03762)
|
||||
on an NLP auto-regression task (with Tiny Shakespeare dataset).
|
||||
"""
|
||||
|
||||
|
||||
@ -9,7 +9,7 @@ summary: >
|
||||
# Compressive Transformer
|
||||
|
||||
This is an implementation of
|
||||
[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
|
||||
[Compressive Transformers for Long-Range Sequence Modelling](https://papers.labml.ai/paper/1911.05507)
|
||||
in [PyTorch](https://pytorch.org).
|
||||
|
||||
This is an extension of [Transformer XL](../xl/index.html) where past memories
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# [Compressive Transformer](https://nn.labml.ai/transformers/compressive/index.html)
|
||||
|
||||
This is an implementation of
|
||||
[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
|
||||
[Compressive Transformers for Long-Range Sequence Modelling](https://papers.labml.ai/paper/1911.05507)
|
||||
in [PyTorch](https://pytorch.org).
|
||||
|
||||
This is an extension of [Transformer XL](https://nn.labml.ai/transformers/xl/index.html) where past memories
|
||||
|
||||
@ -66,7 +66,7 @@ def _ffn_activation_gelu():
|
||||
|
||||
$$x \Phi(x)$$ where $\Phi(x) = P(X \le x), X \sim \mathcal{N}(0,1)$
|
||||
|
||||
It was introduced in paper [Gaussian Error Linear Units](https://arxiv.org/abs/1606.08415).
|
||||
It was introduced in paper [Gaussian Error Linear Units](https://papers.labml.ai/paper/1606.08415).
|
||||
"""
|
||||
return nn.GELU()
|
||||
|
||||
@ -86,7 +86,7 @@ def _feed_forward(c: FeedForwardConfigs):
|
||||
|
||||
# ## GLU Variants
|
||||
# These are variants with gated hidden layers for the FFN
|
||||
# as introduced in paper [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202).
|
||||
# as introduced in paper [GLU Variants Improve Transformer](https://papers.labml.ai/paper/2002.05202).
|
||||
# We have omitted the bias terms as specified in the paper.
|
||||
|
||||
# ### FFN with Gated Linear Units
|
||||
|
||||
@ -9,7 +9,7 @@ summary: >
|
||||
# Fast weights transformer
|
||||
|
||||
The paper
|
||||
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://arxiv.org/abs/2102.11174)
|
||||
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://papers.labml.ai/paper/2102.11174)
|
||||
finds similarities between linear self-attention and fast weight systems
|
||||
and makes modifications to self-attention update rule based on that.
|
||||
It also introduces a simpler, yet effective kernel function.
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# [Fast weights transformer](https://nn.labml.ai/transformers/fast_weights/index.html)
|
||||
|
||||
This is an annotated implementation of the paper
|
||||
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://arxiv.org/abs/2102.11174).
|
||||
[Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch](https://papers.labml.ai/paper/2102.11174).
|
||||
|
||||
Here is the [annotated implementation](https://nn.labml.ai/transformers/fast_weights/index.html).
|
||||
Here are [the training code](https://nn.labml.ai/transformers/fast_weights/experiment.html)
|
||||
|
||||
@ -28,7 +28,7 @@ $$x \Phi(x)$$ where $\Phi(x) = P(X \le x), X \sim \mathcal{N}(0,1)$
|
||||
### Gated Linear Units
|
||||
|
||||
This is a generic implementation that supports different variants including
|
||||
[Gated Linear Units](https://arxiv.org/abs/2002.05202) (GLU).
|
||||
[Gated Linear Units](https://papers.labml.ai/paper/2002.05202) (GLU).
|
||||
We have also implemented experiments on these:
|
||||
|
||||
* [experiment that uses `labml.configs`](glu_variants/experiment.html)
|
||||
|
||||
@ -8,7 +8,7 @@ summary: >
|
||||
# Feedback Transformer
|
||||
|
||||
This is a [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402).
|
||||
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://papers.labml.ai/paper/2002.09402).
|
||||
|
||||
Normal transformers process tokens in parallel. Each transformer layer pays attention
|
||||
to the outputs of the previous layer.
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# [Feedback Transformer](https://nn.labml.ai/transformers/feedback/index.html)
|
||||
|
||||
This is a [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402).
|
||||
[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://papers.labml.ai/paper/2002.09402).
|
||||
|
||||
Normal transformers process tokens in parallel. Each transformer layer pays attention
|
||||
to the outputs of the previous layer.
|
||||
|
||||
@ -8,7 +8,7 @@ summary: >
|
||||
# FNet: Mixing Tokens with Fourier Transforms
|
||||
|
||||
This is a [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824).
|
||||
[FNet: Mixing Tokens with Fourier Transforms](https://papers.labml.ai/paper/2105.03824).
|
||||
|
||||
This paper replaces the [self-attention layer](../mha.html) with two
|
||||
[Fourier transforms](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) to
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# [FNet: Mixing Tokens with Fourier Transforms](https://nn.labml.ai/transformers/fnet/index.html)
|
||||
|
||||
This is a [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824).
|
||||
[FNet: Mixing Tokens with Fourier Transforms](https://papers.labml.ai/paper/2105.03824).
|
||||
|
||||
This paper replaces the [self-attention layer](https://nn.labml.ai/transformers//mha.html) with two
|
||||
[Fourier transforms](https://en.wikipedia.org/wiki/Discrete_Fourier_transform) to
|
||||
|
||||
@ -12,7 +12,7 @@ summary: >
|
||||
# k-Nearest Neighbor Language Models
|
||||
|
||||
This is a [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[Generalization through Memorization: Nearest Neighbor Language Models](https://arxiv.org/abs/1911.00172).
|
||||
[Generalization through Memorization: Nearest Neighbor Language Models](https://papers.labml.ai/paper/1911.00172).
|
||||
It uses k-nearest neighbors to improve perplexity of autoregressive transformer models.
|
||||
|
||||
An autoregressive language model estimates $p(w_t | \color{yellowgreen}{c_t})$,
|
||||
|
||||
@ -9,7 +9,7 @@ summary: >
|
||||
# Multi-Headed Attention (MHA)
|
||||
|
||||
This is a tutorial/implementation of multi-headed attention
|
||||
from paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
|
||||
from paper [Attention Is All You Need](https://papers.labml.ai/paper/1706.03762)
|
||||
in [PyTorch](https://pytorch.org/).
|
||||
The implementation is inspired from [Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html).
|
||||
|
||||
|
||||
@ -9,7 +9,7 @@ summary: >
|
||||
|
||||
This is a [PyTorch](https://pytorch.org) implementation of the Masked Language Model (MLM)
|
||||
used to pre-train the BERT model introduced in the paper
|
||||
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
|
||||
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://papers.labml.ai/paper/1810.04805).
|
||||
|
||||
## BERT Pretraining
|
||||
|
||||
|
||||
@ -2,7 +2,7 @@
|
||||
|
||||
This is a [PyTorch](https://pytorch.org) implementation of Masked Language Model (MLM)
|
||||
used to pre-train the BERT model introduced in the paper
|
||||
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
|
||||
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://papers.labml.ai/paper/1810.04805).
|
||||
|
||||
## BERT Pretraining
|
||||
|
||||
|
||||
@ -71,7 +71,7 @@ class TransformerLayer(Module):
|
||||
Alternative is to do a layer normalization after adding the residuals.
|
||||
But we found this to be less stable when training.
|
||||
We found a detailed discussion about this in the paper
|
||||
[On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745).
|
||||
[On Layer Normalization in the Transformer Architecture](https://papers.labml.ai/paper/2002.04745).
|
||||
"""
|
||||
|
||||
def __init__(self, *,
|
||||
|
||||
@ -8,7 +8,7 @@ summary: >
|
||||
# Switch Transformer
|
||||
|
||||
This is a miniature [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
|
||||
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://papers.labml.ai/paper/2101.03961).
|
||||
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
|
||||
It does single GPU training, but we implement the concept of switching as described in the paper.
|
||||
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# [Switch Transformer](https://nn.labml.ai/transformers/switch/index.html)
|
||||
|
||||
This is a miniature [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
|
||||
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://papers.labml.ai/paper/2101.03961).
|
||||
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
|
||||
It does single GPU training, but we implement the concept of switching as described in the paper.
|
||||
|
||||
|
||||
@ -9,7 +9,7 @@ summary: >
|
||||
# Vision Transformer (ViT)
|
||||
|
||||
This is a [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929).
|
||||
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://papers.labml.ai/paper/2010.11929).
|
||||
|
||||
Vision transformer applies a pure transformer to images
|
||||
without any convolution layers.
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# [Vision Transformer (ViT)](https://nn.labml.ai/transformer/vit/index.html)
|
||||
|
||||
This is a [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://arxiv.org/abs/2010.11929).
|
||||
[An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale](https://papers.labml.ai/paper/2010.11929).
|
||||
|
||||
Vision transformer applies a pure transformer to images
|
||||
without any convolution layers.
|
||||
|
||||
@ -9,7 +9,7 @@ summary: >
|
||||
# Transformer XL
|
||||
|
||||
This is an implementation of
|
||||
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
|
||||
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://papers.labml.ai/paper/1901.02860)
|
||||
in [PyTorch](https://pytorch.org).
|
||||
|
||||
Transformer has a limited attention span,
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# [Transformer XL](https://nn.labml.ai/transformers/xl/index.html)
|
||||
|
||||
This is an implementation of
|
||||
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
|
||||
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://papers.labml.ai/paper/1901.02860)
|
||||
in [PyTorch](https://pytorch.org).
|
||||
|
||||
Transformer has a limited attention span,
|
||||
|
||||
@ -9,7 +9,7 @@ summary: >
|
||||
# Relative Multi-Headed Attention
|
||||
|
||||
This is an implementation of relative multi-headed attention from paper
|
||||
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860)
|
||||
[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://papers.labml.ai/paper/1901.02860)
|
||||
in [PyTorch](https://pytorch.org).
|
||||
"""
|
||||
|
||||
|
||||
Reference in New Issue
Block a user