papers list

2025-08-26 08:41:23 +08:00 · 2021-08-17 15:27:00 +05:30
parent 996b58be04
commit e28d6ed0a3
73 changed files with 359 additions and 138 deletions
--- a/docs/transformers/switch/index.html
+++ b/docs/transformers/switch/index.html
@ -69,7 +69,7 @@
                </div>
                <h1>Switch Transformer</h1>
 <p>This is a miniature <a href="https://pytorch.org">PyTorch</a> implementation of the paper
-<a href="https://arxiv.org/abs/2101.03961">Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity</a>.
+<a href="https://papers.labml.ai/paper/2101.03961">Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity</a>.
 Our implementation only has a few million parameters and doesn&rsquo;t do model parallel distributed training.
 It does single GPU training, but we implement the concept of switching as described in the paper.</p>
 <p>The Switch Transformer uses different parameters for each token by switching among parameters
--- a/docs/transformers/switch/readme.html
+++ b/docs/transformers/switch/readme.html
@ -69,7 +69,7 @@
                </div>
                <h1><a href="https://nn.labml.ai/transformers/switch/index.html">Switch Transformer</a></h1>
 <p>This is a miniature <a href="https://pytorch.org">PyTorch</a> implementation of the paper
-<a href="https://arxiv.org/abs/2101.03961">Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity</a>.
+<a href="https://papers.labml.ai/paper/2101.03961">Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity</a>.
 Our implementation only has a few million parameters and doesn&rsquo;t do model parallel distributed training.
 It does single GPU training, but we implement the concept of switching as described in the paper.</p>
 <p>The Switch Transformer uses different parameters for each token by switching among parameters