diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 2a46b7ad..c9abf985 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -337,7 +337,14 @@ https://nn.labml.ai/transformers/switch/index.html - 2021-01-30T16:30:00+00:00 + 2021-02-01T16:30:00+00:00 + 1.00 + + + + + https://nn.labml.ai/transformers/switch/readme.html + 2021-02-01T16:30:00+00:00 1.00 diff --git a/docs/transformers/feedback/README.html b/docs/transformers/feedback/README.html new file mode 100644 index 00000000..e77c8843 --- /dev/null +++ b/docs/transformers/feedback/README.html @@ -0,0 +1,141 @@ + + + + + + + + + + + + + + + + + + + + + + + Feedback Transformer + + + + + + + + +
+
+
+
+

+ home + transformers + feedback +

+

+ + + Github + + Join Slact + + Twitter +

+
+
+
+
+ +

Feedback Transformer

+

This is a PyTorch implementation of the paper +Accessing Higher-level Representations in Sequential Transformers with Feedback Memory.

+

Normal transformers process tokens in parallel. Each transformer layer pays attention +to the outputs of the previous layer. +Feedback transformer pays attention to the output of all layers in previous steps. +So this adds recurrence, and we need to process token-by-token. +This slows down the training significantly (about 5X - 10X depending on the sequence length). +However, when predicting Feedback Transformer is faster because you can predict the next token +if you cache the memory vectors.

+

In order to speed up the training the paper discusses starting with a short sequence length and +gradually increasing it. +They also discuss using a pretrained parallel transformer as the starting point.

+

The original feedback transformer doesn’t keep the outputs of all layers. +Instead it keeps weighted sum of the output of all layers. +This reduces the memory used for caching during prediction. +The first half of this file implements this.

+

The updated feedback transformer shares weights used +to calculate keys and values among the layers. +We then calculate the keys and values for each step only once and keep +them cached. +The second half of this file implements this. +We implemented a custom PyTorch function to improve performance.

+

Here’s the training code and a notebook for training a feedback transformer on Tiny Shakespeare dataset.

+

Colab Notebook

+

Open In Colab +View Run +“”“

+
+
+ +
+
+
+ + + + + + \ No newline at end of file diff --git a/docs/transformers/feedback/index.html b/docs/transformers/feedback/index.html index 3c0322be..0cb8209d 100644 --- a/docs/transformers/feedback/index.html +++ b/docs/transformers/feedback/index.html @@ -78,9 +78,9 @@

Normal transformers process tokens in parallel. Each transformer layer pays attention to the outputs of the previous layer. Feedback transformer pays attention to the output of all layers in previous steps. -So this adds recurrence and we need to process token-by-token. +So this adds recurrence, and we need to process token-by-token. This slows down the training significantly (about 5X - 10X depending on the sequence length). -However when predicting Feedback Transformer is faster because you can predict the next token +However, when predicting Feedback Transformer is faster because you can predict the next token if you cache the memory vectors.

In order to speed up the training the paper discusses starting with a short sequence length and gradually increasing it. diff --git a/labml_nn/transformers/feedback/README.md b/labml_nn/transformers/feedback/README.md new file mode 100644 index 00000000..1f02d9ad --- /dev/null +++ b/labml_nn/transformers/feedback/README.md @@ -0,0 +1,36 @@ +# [Feedback Transformer](https://nn.labml.ai/transformers/feedback/index.html) + +This is a [PyTorch](https://pytorch.org) implementation of the paper +[Accessing Higher-level Representations in Sequential Transformers with Feedback Memory](https://arxiv.org/abs/2002.09402). + +Normal transformers process tokens in parallel. Each transformer layer pays attention +to the outputs of the previous layer. +Feedback transformer pays attention to the output of all layers in previous steps. +So this adds recurrence, and we need to process token-by-token. +This slows down the training significantly (about 5X - 10X depending on the sequence length). +However, when predicting Feedback Transformer is faster because you can predict the next token +if you cache the memory vectors. + +In order to speed up the training the paper discusses starting with a short sequence length and +gradually increasing it. +They also discuss using a pretrained parallel transformer as the starting point. + +The original feedback transformer doesn't keep the outputs of all layers. +Instead it keeps weighted sum of the output of all layers. +This reduces the memory used for caching during prediction. +The first half of this file implements this. + +The updated feedback transformer shares weights used +to calculate keys and values among the layers. +We then calculate the keys and values for each step only once and keep +them cached. +The [second half](#shared_kv) of this file implements this. +We implemented a custom PyTorch function to improve performance. + +Here's [the training code](experiment.html) and a notebook for training a feedback transformer on Tiny Shakespeare dataset. + +[Colab Notebook](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/feedback/experiment.ipynb) + +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/feedback/experiment.ipynb) +[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=d8eb9416530a11eb8fb50242ac1c0002) +""" \ No newline at end of file diff --git a/labml_nn/transformers/feedback/__init__.py b/labml_nn/transformers/feedback/__init__.py index cdddf7c4..29bef0ba 100644 --- a/labml_nn/transformers/feedback/__init__.py +++ b/labml_nn/transformers/feedback/__init__.py @@ -13,9 +13,9 @@ This is a [PyTorch](https://pytorch.org) implementation of the paper Normal transformers process tokens in parallel. Each transformer layer pays attention to the outputs of the previous layer. Feedback transformer pays attention to the output of all layers in previous steps. -So this adds recurrence and we need to process token-by-token. +So this adds recurrence, and we need to process token-by-token. This slows down the training significantly (about 5X - 10X depending on the sequence length). -However when predicting Feedback Transformer is faster because you can predict the next token +However, when predicting Feedback Transformer is faster because you can predict the next token if you cache the memory vectors. In order to speed up the training the paper discusses starting with a short sequence length and