annotated_deep_learning_paper_implementations

{
 "<h1><a href=\"https://nn.labml.ai/transformers/feedback/index.html\">Feedback Transformer</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of the paper <a href=\"https://arxiv.org/abs/2002.09402\">Accessing Higher-level Representations in Sequential Transformers with Feedback Memory</a>.</p>\n<p>Normal transformers process tokens in parallel. Each transformer layer pays attention to the outputs of the previous layer. Feedback transformer pays attention to the output of all layers in previous steps. So this adds recurrence, and we need to process token-by-token. This slows down the training significantly (about 5X - 10X depending on the sequence length). However, when predicting Feedback Transformer is faster because you can predict the next token if you cache the memory vectors.</p>\n<p>In order to speed up the training the paper discusses starting with a short sequence length and gradually increasing it. They also discuss using a pretrained parallel transformer as the starting point.</p>\n<p>The original feedback transformer doesn&#x27;t keep the outputs of all layers. Instead it keeps weighted sum of the output of all layers. This reduces the memory used for caching during prediction. The first half of this file implements this.</p>\n<p>The updated feedback transformer shares weights used to calculate keys and values among the layers. We then calculate the keys and values for each step only once and keep them cached. The <a href=\"#shared_kv\">second half</a> of this file implements this. We implemented a custom PyTorch function to improve performance.</p>\n<p>Here&#x27;s <a href=\"experiment.html\">the training code</a> and a notebook for training a feedback transformer on Tiny Shakespeare dataset.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/feedback/experiment.ipynb\">Colab Notebook</a></p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/feedback/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/transformers/feedback/index.html\">\u53cd\u9988\u53d8\u538b\u5668</a></h1>\n<p>\u8fd9\u662f <a href=\"https://pytorch.org\">PyTorch \u5bf9</a>\u300a\u4f7f\u7528<a href=\"https://arxiv.org/abs/2002.09402\">\u53cd\u9988\u5b58\u50a8\u5668\u8bbf\u95ee\u5e8f\u5217\u53d8\u538b\u5668\u4e2d\u7684\u66f4\u9ad8\u5c42\u6b21\u8868\u793a\u300b\u4e00\u6587\u7684 PyT</a> orch \u5b9e\u73b0\u3002</p>\n<p>\u666e\u901a\u7684\u53d8\u538b\u5668\u4f1a\u5e76\u884c\u5904\u7406\u4ee3\u5e01\u3002\u6bcf\u4e2a\u53d8\u538b\u5668\u5c42\u90fd\u6ce8\u610f\u524d\u4e00\u5c42\u7684\u8f93\u51fa\u3002\u53cd\u9988\u53d8\u538b\u5668\u6ce8\u610f\u524d\u9762\u6b65\u9aa4\u4e2d\u6240\u6709\u5c42\u7684\u8f93\u51fa\u3002\u56e0\u6b64\uff0c\u8fd9\u4f1a\u589e\u52a0\u91cd\u590d\u6027\uff0c\u6211\u4eec\u9700\u8981\u9010\u4e2a\u4ee3\u5e01\u8fdb\u884c\u5904\u7406\u3002\u8fd9\u4f1a\u663e\u8457\u51cf\u6162\u8bad\u7ec3\u901f\u5ea6\uff08\u5927\u7ea6 5 \u5230 10 \u500d\uff0c\u5177\u4f53\u53d6\u51b3\u4e8e\u5e8f\u5217\u957f\u5ea6\uff09\u3002\u4f46\u662f\uff0c\u5728\u9884\u6d4b\u53cd\u9988\u53d8\u6362\u5668\u65f6\uff0c\u901f\u5ea6\u66f4\u5feb\uff0c\u56e0\u4e3a\u5982\u679c\u4f60\u7f13\u5b58\u4e86\u5185\u5b58\u5411\u91cf\uff0c\u4f60\u53ef\u4ee5\u9884\u6d4b\u4e0b\u4e00\u4e2a\u6807\u8bb0\u3002</p>\n<p>\u4e3a\u4e86\u52a0\u5feb\u8bad\u7ec3\u901f\u5ea6\uff0c\u672c\u6587\u8ba8\u8bba\u4e86\u4ece\u77ed\u5e8f\u5217\u957f\u5ea6\u5f00\u59cb\u5e76\u9010\u6e10\u589e\u52a0\u5e8f\u5217\u957f\u5ea6\u7684\u95ee\u9898\u3002\u4ed6\u4eec\u8fd8\u8ba8\u8bba\u4e86\u4f7f\u7528\u9884\u8bad\u7ec3\u7684\u5e76\u884c\u53d8\u538b\u5668\u4f5c\u4e3a\u8d77\u70b9\u3002</p>\n<p>\u539f\u59cb\u53cd\u9988\u53d8\u538b\u5668\u4e0d\u4fdd\u7559\u6240\u6709\u5c42\u7684\u8f93\u51fa\u3002\u76f8\u53cd\uff0c\u5b83\u4fdd\u7559\u6240\u6709\u56fe\u5c42\u8f93\u51fa\u7684\u52a0\u6743\u603b\u548c\u3002\u8fd9\u51cf\u5c11\u4e86\u9884\u6d4b\u671f\u95f4\u7528\u4e8e\u7f13\u5b58\u7684\u5185\u5b58\u3002\u8fd9\u4e2a\u6587\u4ef6\u7684\u524d\u534a\u90e8\u5206\u5b9e\u73b0\u4e86\u8fd9\u4e00\u70b9\u3002</p>\n<p>\u66f4\u65b0\u540e\u7684\u53cd\u9988\u53d8\u538b\u5668\u5728\u5404\u5c42\u4e4b\u95f4\u5171\u4eab\u7528\u4e8e\u8ba1\u7b97\u5bc6\u94a5\u548c\u503c\u7684\u6743\u91cd\u3002\u7136\u540e\uff0c\u6211\u4eec\u53ea\u8ba1\u7b97\u6bcf\u4e2a\u6b65\u9aa4\u7684\u952e\u548c\u503c\u4e00\u6b21\uff0c\u5e76\u5c06\u5176\u7f13\u5b58\u3002\u8fd9\u4e2a\u6587\u4ef6\u7684<a href=\"#shared_kv\">\u540e\u534a</a>\u90e8\u5206\u5b9e\u73b0\u4e86\u8fd9\u4e00\u70b9\u3002\u6211\u4eec\u5b9e\u73b0\u4e86\u4e00\u4e2a\u81ea\u5b9a\u4e49 PyTorch \u51fd\u6570\u6765\u63d0\u9ad8\u6027\u80fd\u3002</p>\n<p>\u8fd9\u662f<a href=\"experiment.html\">\u8bad\u7ec3\u4ee3\u7801\u548c\u4e00\u672c</a>\u7528\u4e8e\u5728 Tiny Shakespeare \u6570\u636e\u96c6\u4e0a\u8bad\u7ec3\u53cd\u9988\u8f6c\u6362\u5668\u7684\u7b14\u8bb0\u672c\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/feedback/experiment.ipynb\">Colab \u7b14\u8bb0\u672c</a></p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/feedback/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
 "Feedback Transformer": "\u53cd\u9988\u53d8\u538b\u5668"
}