diff --git a/translate_cache/transformers/aft/__init__.zh.json b/translate_cache/transformers/aft/__init__.zh.json index 61c69414..779f42de 100644 --- a/translate_cache/transformers/aft/__init__.zh.json +++ b/translate_cache/transformers/aft/__init__.zh.json @@ -24,6 +24,6 @@ "
We subtract _^_0_^_ and _^_1_^_ before calculating the exponents to stabilize the softmax calculation.
\nIf _^_2_^_ is large _^_3_^_ becomes huge and the computation of _^_4_^_becomes unstable. Subtracting a constant before calculating the exponent from numerator and denominator will cancel out. and can help stabilize the computation. So we subtract _^_5_^_ to stabilize the computation.
\n": "\u6211\u4eec\u5728\u8ba1\u7b97\u6307\u6570_^_1_^_\u4e4b\u524d\u51cf\u53bb_^_0_^_\u548c\uff0c\u4ee5\u7a33\u5b9asoftmax\u7684\u8ba1\u7b97\u3002
\n_^_2_^_if \u5927_^_3_^_\u53d8\u5927\uff0c\u8ba1\u7b97_^_4_^_\u53d8\u5f97\u4e0d\u7a33\u5b9a\u3002\u5728\u8ba1\u7b97\u5206\u5b50\u548c\u5206\u6bcd\u7684\u6307\u6570\u4e4b\u524d\u51cf\u53bb\u4e00\u4e2a\u5e38\u6570\u5c06\u62b5\u6d88\u3002\u5e76\u4e14\u53ef\u4ee5\u5e2e\u52a9\u7a33\u5b9a\u8ba1\u7b97\u3002\u6240\u4ee5\u6211\u4eec\u51cf\u53bb_^_5_^_\u4ee5\u7a33\u5b9a\u8ba1\u7b97\u3002
\n", "_^_0_^_We compute _^_1_^_, _^_2_^_ and _^_3_^_ separately and do a matrix multiplication. We use einsum for clarity.
\n": "_^_0_^_\u6211\u4eec_^_3_^_\u5206\u522b\u8ba1\u7b97_^_1_^_\uff0c_^_2_^_\u7136\u540e\u8fdb\u884c\u77e9\u9635\u4e58\u6cd5\u3002\u4e3a\u4e86\u6e05\u695a\u8d77\u89c1\uff0c\u6211\u4eec\u4f7f\u7528 einsum\u3002
\n", "This is a PyTorch implementation of the paper An Attention Free Transformer.
\nThis paper replaces the self-attention layer with a new efficient operation, that has memory complexity of O(Td), where T is the sequence length and _^_0_^_ is the dimensionality of embeddings.
\nThe paper introduces AFT along with AFT-local and AFT-conv. Here we have implemented AFT-local which pays attention to closeby tokens in an autoregressive model.
\n": "\u8fd9\u662f PyTorch \u5bf9\u300a\u65e0\u6ce8\u610f\u529b\u7684\u53d8\u5f62\u91d1\u521a\u300b\u4e00\u6587\u7684\u5b9e\u73b0\u3002
\n\u672c\u6587\u7528\u4e00\u79cd\u65b0\u7684\u9ad8\u6548\u8fd0\u7b97\u53d6\u4ee3\u4e86\u81ea\u6211\u6ce8\u610f\u529b\u5c42\uff0c\u8be5\u8fd0\u7b97\u7684\u5b58\u50a8\u590d\u6742\u5ea6\u4e3aO\uff08Td\uff09\uff0c\u5176\u4e2d T \u662f\u5e8f\u5217\u957f\u5ea6\uff0c_^_0_^_\u662f\u5d4c\u5165\u7684\u7ef4\u5ea6\u3002
\n\u672c\u6587\u4ecb\u7ecd\u4e86 AFT \u4ee5\u53ca AFT-Local \u548c AFT-conv\u3002\u8fd9\u91cc\u6211\u4eec\u5b9e\u73b0\u4e86 aft-Local\uff0c\u5b83\u5173\u6ce8\u81ea\u56de\u5f52\u6a21\u578b\u4e2d\u7684 cloby \u4ee3\u5e01\u3002
\n", - "An Attention Free Transformer": "\u514d\u6ce8\u610f\u7684\u53d8\u538b\u5668" + "This is a PyTorch implementation of the paper An Attention Free Transformer.
\nThis paper replaces the self-attention layer with a new efficient operation, that has memory complexity of O(Td), where T is the sequence length and _^_0_^_ is the dimensionality of embeddings.
\nThe paper introduces AFT along with AFT-local and AFT-conv. Here we have implemented AFT-local which pays attention to closeby tokens in an autoregressive model.
\n": "\u8fd9\u662f\u8bba\u6587 \u300a\u4e00\u79cd\u65e0\u6ce8\u610f\u529b\u7684 Transformer \u300b\u7684PyTorch \u5b9e\u73b0\u3002
\n\u8fd9\u7bc7\u8bba\u6587\u7528\u4e00\u79cd\u65b0\u7684\u9ad8\u6548\u64cd\u4f5c\u66ff\u4ee3\u4e86\u81ea\u6ce8\u610f\u529b\u5c42\uff0c\u8be5\u8fd0\u7b97\u7684\u5b58\u50a8\u590d\u6742\u5ea6\u4e3aO\uff08Td\uff09\uff0c\u5176\u4e2d T \u662f\u5e8f\u5217\u957f\u5ea6\uff0c_^_0_^_\u662f\u5d4c\u5165\u7684\u7ef4\u5ea6\u3002
\n\u8be5\u8bba\u6587\u4ecb\u7ecd\u4e86 AFT \u4ee5\u53ca AFT-local \u548c AFT-conv \u3002\u8fd9\u91cc\u6211\u4eec\u5b9e\u73b0\u4e86 AFT-local \uff0c\u5b83\u4f1a\u5728\u81ea\u56de\u5f52\u6a21\u578b\u4e2d\u5173\u6ce8\u90bb\u8fd1\u7684 token \u3002
\n", + "An Attention Free Transformer": "\u4e00\u79cd\u65e0\u6ce8\u610f\u529b\u7684 Transformer" } \ No newline at end of file