arxiv.org links

2025-11-01 20:28:41 +08:00 · 2023-10-24 14:42:32 +01:00
parent 1159ecfc63
commit 9a42ac2697
238 changed files with 354 additions and 353 deletions
--- a/docs/transformers/vit/readme.html
+++ b/docs/transformers/vit/readme.html
@ -72,7 +72,7 @@
                <a href='#section-0'>#</a>
            </div>
            <h1><a href="https://nn.labml.ai/transformer/vit/index.html">Vision Transformer (ViT)</a></h1>
-<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://papers.labml.ai/paper/2010.11929">An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale</a>.</p>
+<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://arxiv.org/abs/2010.11929">An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale</a>.</p>
 <p>Vision transformer applies a pure transformer to images without any convolution layers. They split the image into patches and apply a transformer on patch embeddings. <a href="https://nn.labml.ai/transformer/vit/index.html#PathEmbeddings">Patch embeddings</a> are generated by applying a simple linear transformation to the flattened pixel values of the patch. Then a standard transformer encoder is fed with the patch embeddings, along with a classification token <code  class="highlight"><span></span><span class="p">[</span><span class="n">CLS</span><span class="p">]</span></code>
 . The encoding on the <code  class="highlight"><span></span><span class="p">[</span><span class="n">CLS</span><span class="p">]</span></code>
 token is used to classify the image with an MLP.</p>