arxiv.org links

This commit is contained in:
Varuna Jayasiri
2023-10-24 14:42:32 +01:00
parent 1159ecfc63
commit 9a42ac2697
238 changed files with 354 additions and 353 deletions

View File

@ -72,7 +72,7 @@
<a href='#section-0'>#</a>
</div>
<h1><a href="https://nn.labml.ai/transformer/vit/index.html">Vision Transformer (ViT)</a></h1>
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://papers.labml.ai/paper/2010.11929">An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale</a>.</p>
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://arxiv.org/abs/2010.11929">An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale</a>.</p>
<p>Vision transformer applies a pure transformer to images without any convolution layers. They split the image into patches and apply a transformer on patch embeddings. <a href="https://nn.labml.ai/transformer/vit/index.html#PathEmbeddings">Patch embeddings</a> are generated by applying a simple linear transformation to the flattened pixel values of the patch. Then a standard transformer encoder is fed with the patch embeddings, along with a classification token <code class="highlight"><span></span><span class="p">[</span><span class="n">CLS</span><span class="p">]</span></code>
. The encoding on the <code class="highlight"><span></span><span class="p">[</span><span class="n">CLS</span><span class="p">]</span></code>
token is used to classify the image with an MLP.</p>