mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-11-01 20:28:41 +08:00
arxiv.org links
This commit is contained in:
@ -72,7 +72,7 @@
|
||||
<a href='#section-0'>#</a>
|
||||
</div>
|
||||
<h1><a href="https://nn.labml.ai/transformer/vit/index.html">Vision Transformer (ViT)</a></h1>
|
||||
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://papers.labml.ai/paper/2010.11929">An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale</a>.</p>
|
||||
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of the paper <a href="https://arxiv.org/abs/2010.11929">An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale</a>.</p>
|
||||
<p>Vision transformer applies a pure transformer to images without any convolution layers. They split the image into patches and apply a transformer on patch embeddings. <a href="https://nn.labml.ai/transformer/vit/index.html#PathEmbeddings">Patch embeddings</a> are generated by applying a simple linear transformation to the flattened pixel values of the patch. Then a standard transformer encoder is fed with the patch embeddings, along with a classification token <code class="highlight"><span></span><span class="p">[</span><span class="n">CLS</span><span class="p">]</span></code>
|
||||
. The encoding on the <code class="highlight"><span></span><span class="p">[</span><span class="n">CLS</span><span class="p">]</span></code>
|
||||
token is used to classify the image with an MLP.</p>
|
||||
|
||||
Reference in New Issue
Block a user