annotated_deep_learning_paper_implementations

{
 "<h1><a href=\"https://nn.labml.ai/transformers/compressive/index.html\">Compressive Transformer</a></h1>\n<p>This is an implementation of <a href=\"https://arxiv.org/abs/1911.05507\">Compressive Transformers for Long-Range Sequence Modelling</a> in <a href=\"https://pytorch.org\">PyTorch</a>.</p>\n<p>This is an extension of <a href=\"https://nn.labml.ai/transformers/xl/index.html\">Transformer XL</a> where past memories are compressed to give a longer attention range. That is, the furthest <span translate=no>_^_0_^_</span> memories are compressed into <span translate=no>_^_1_^_</span> memories, where <span translate=no>_^_2_^_</span> is the compression rate.</p>\n<h2>Compression operation</h2>\n<p>The compression operation is defined as <span translate=no>_^_3_^_</span>. The paper introduces multiple choices for <span translate=no>_^_4_^_</span> and we have only implemented 1D convolution which seems to give the best results. Each layer has a separate compression operation <span translate=no>_^_5_^_</span> where <span translate=no>_^_6_^_</span> is the layer number.</p>\n<h2>Training compression operation</h2>\n<p>Since training compression with BPTT requires maintaining a very large computational graph (many time steps), the paper proposes an <em>auto-encoding loss</em> and an <em>attention reconstruction loss</em>. The auto-encoding loss decodes the original memories from the compressed memories and calculates the loss. Attention reconstruction loss computes the multi-headed attention results on the compressed memory and on uncompressed memory and gets a mean squared error between them. We have implemented the latter here since it gives better results.</p>\n<p>This implementation uses pre-layer normalization while the paper uses post-layer normalization. Pre-layer norm does the layer norm before <a href=\"../feedforward.html\">FFN</a> and self-attention, and the pass-through in the residual connection is not normalized. This is supposed to be more stable in standard transformer setups.</p>\n<p>Here are <a href=\"https://nn.labml.ai/transformers/compressive/experiment.html\">the training code</a> and a notebook for training a compressive transformer model on the Tiny Shakespeare dataset.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/compressive/experiment.ipynb\"><span translate=no>_^_7_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/transformers/compressive/index.html\">\u538b\u7f29\u53d8\u538b\u5668</a></h1>\n<p>\u8fd9\u662f <a href=\"https://pytorch.org\">PyTorch</a> \u4e2d<a href=\"https://arxiv.org/abs/1911.05507\">\u7528\u4e8e\u8fdc\u7a0b\u5e8f\u5217\u5efa\u6a21\u7684\u538b\u7f29\u8f6c\u6362\u5668\u7684</a>\u5b9e\u73b0\u3002</p>\n<p>\u8fd9\u662f Transfor <a href=\"https://nn.labml.ai/transformers/xl/index.html\">mer XL</a> \u7684\u6269\u5c55\uff0c\u5b83\u538b\u7f29\u4e86\u8fc7\u53bb\u7684\u8bb0\u5fc6\u4ee5\u63d0\u4f9b\u66f4\u957f\u7684\u6ce8\u610f\u529b\u8303\u56f4\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c\u6700\u8fdc\u7684<span translate=no>_^_0_^_</span>\u5185\u5b58\u88ab\u538b\u7f29\u5230<span translate=no>_^_1_^_</span>\u5185\u5b58\u4e2d\uff0c\u538b\u7f29\u7387\u5728<span translate=no>_^_2_^_</span>\u54ea\u91cc\u3002</p>\n<h2>\u538b\u7f29\u64cd\u4f5c</h2>\n<p>\u538b\u7f29\u64cd\u4f5c\u5b9a\u4e49\u4e3a<span translate=no>_^_3_^_</span>\u3002\u672c\u6587\u4ecb\u7ecd\u4e86\u591a\u79cd\u9009\u62e9<span translate=no>_^_4_^_</span>\uff0c\u6211\u4eec\u53ea\u5b9e\u73b0\u4e86\u4e00\u7ef4\u5377\u79ef\uff0c\u8fd9\u4f3c\u4e4e\u53ef\u4ee5\u7ed9\u51fa\u6700\u4f73\u7ed3\u679c\u3002\u6bcf\u4e2a\u5c42\u90fd\u6709\u5355\u72ec\u7684\u538b\u7f29\u64cd\u4f5c<span translate=no>_^_6_^_</span>\uff0c<span translate=no>_^_5_^_</span>\u5176\u4e2d\u662f\u5c42\u53f7\u3002</p>\n<h2>\u8bad\u7ec3\u538b\u7f29\u64cd\u4f5c</h2>\n<p>\u7531\u4e8e\u4f7f\u7528 BPTT \u8bad\u7ec3\u538b\u7f29\u9700\u8981\u7ef4\u62a4\u975e\u5e38\u5927\u7684\u8ba1\u7b97\u56fe\uff08\u8bb8\u591a\u65f6\u95f4\u6b65\u957f\uff09\uff0c\u56e0\u6b64\u8be5\u8bba\u6587\u63d0\u51fa\u4e86<em>\u81ea\u52a8\u7f16\u7801\u635f\u5931</em>\u548c<em>\u6ce8\u610f\u529b\u91cd\u5efa\u635f\u5931</em>\u3002\u81ea\u52a8\u7f16\u7801\u4e22\u5931\u5bf9\u538b\u7f29\u5b58\u50a8\u5668\u4e2d\u7684\u539f\u59cb\u5b58\u50a8\u5668\u8fdb\u884c\u89e3\u7801\u5e76\u8ba1\u7b97\u635f\u5931\u3002\u6ce8\u610f\u529b\u91cd\u5efa\u635f\u5931\u8ba1\u7b97\u538b\u7f29\u5185\u5b58\u548c\u672a\u538b\u7f29\u5185\u5b58\u4e0a\u7684\u591a\u5934\u6ce8\u610f\u529b\u7ed3\u679c\uff0c\u5e76\u5f97\u51fa\u4e24\u8005\u4e4b\u95f4\u7684\u5e73\u5747\u5e73\u65b9\u8bef\u5dee\u3002\u6211\u4eec\u5728\u8fd9\u91cc\u5b9e\u73b0\u4e86\u540e\u8005\uff0c\u56e0\u4e3a\u5b83\u53ef\u4ee5\u63d0\u4f9b\u66f4\u597d\u7684\u7ed3\u679c\u3002</p>\n<p>\u8be5\u5b9e\u73b0\u4f7f\u7528\u5c42\u524d\u6807\u51c6\u5316\uff0c\u800c\u8bba\u6587\u4f7f\u7528\u5c42\u540e\u5f52\u4e00\u5316\u3002\u524d\u5c42\u8303\u6570\u5728 <a href=\"../feedforward.html\">FFN</a> \u548c\u81ea\u6211\u6ce8\u610f\u529b\u4e4b\u524d\u5bf9\u5c42\u8fdb\u884c\u8303\u6570\uff0c\u5e76\u4e14\u6b8b\u5dee\u8fde\u63a5\u4e2d\u7684\u76f4\u901a\u672a\u6807\u51c6\u5316\u3002\u5728\u6807\u51c6\u53d8\u538b\u5668\u8bbe\u7f6e\u4e2d\uff0c\u8fd9\u5e94\u8be5\u66f4\u7a33\u5b9a\u3002</p>\n<p>\u4ee5\u4e0b\u662f\u7528\u4e8e<a href=\"https://nn.labml.ai/transformers/compressive/experiment.html\">\u5728 Tiny Shakespeare \u6570\u636e\u96c6\u4e0a\u8bad\u7ec3\u538b\u7f29\u53d8\u538b\u5668\u6a21\u578b\u7684\u8bad\u7ec3\u4ee3\u7801</a>\u548c\u7b14\u8bb0\u672c\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/transformers/compressive/experiment.ipynb\"><span translate=no>_^_7_^_</span></a></p>\n",
 "Compressive Transformer": "\u538b\u7f29\u53d8\u538b\u5668"
}