arxiv.org links

2025-10-29 01:26:44 +08:00 · 2023-10-24 14:42:32 +01:00
parent 1159ecfc63
commit 9a42ac2697
238 changed files with 354 additions and 353 deletions
--- a/docs/distillation/readme.html
+++ b/docs/distillation/readme.html
@ -71,7 +71,7 @@
                <a href='#section-0'>#</a>
            </div>
            <h1><a href="https://nn.labml.ai/distillation/index.html">Distilling the Knowledge in a Neural Network</a></h1>
-<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation/tutorial of the paper <a href="https://papers.labml.ai/paper/1503.02531">Distilling the Knowledge in a Neural Network</a>.</p>
+<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation/tutorial of the paper <a href="https://arxiv.org/abs/1503.02531">Distilling the Knowledge in a Neural Network</a>.</p>
 <p>It&#x27;s a way of training a small network using the knowledge in a trained larger network; i.e. distilling the knowledge from the large network.</p>
 <p>A large model with regularization or an ensemble of models (using dropout) generalizes better than a small model when trained directly on the data and labels. However, a small model can be trained to generalize better with help of a large model. Smaller models are better in production: faster, less compute, less memory.</p>
 <p>The output probabilities of a trained model give more information than the labels because it assigns non-zero probabilities to incorrect classes as well. These probabilities tell us that a sample has a chance of belonging to certain classes. For instance, when classifying digits, when given an image of digit <em>7</em>, a generalized model will give a high probability to 7 and a small but non-zero probability to 2, while assigning almost zero probability to other digits. Distillation uses this information to train a small model better. </p>