✍️ typos

This commit is contained in:
Varuna Jayasiri
2021-02-12 14:55:40 +05:30
parent 948f473ee6
commit 56a460a243
7 changed files with 67 additions and 68 deletions

View File

@ -77,18 +77,18 @@
using <a href="https://pytorch.org">PyTorch</a>.
<a href="https://blog.otoro.net/2016/09/28/hyper-networks/">This blog post</a>
by David Ha gives a good explanation of HyperNetworks.</p>
<p>We have an experiment that trains a HyperLSTM to predict text on Shakespear dataset.
<p>We have an experiment that trains a HyperLSTM to predict text on Shakespeare dataset.
Here&rsquo;s the link to code: <a href="experiment.html"><code>experiment.py</code></a></p>
<p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/hypernetworks/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
<a href="https://web.lab-ml.com/run?uuid=9e7f39e047e811ebbaff2b26e3148b3d"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen" /></a></p>
<p>HyperNetworks use a smaller network to generate weights of a larger network.
There are two variants: static hyper-networks and dynamic hyper-networks.
Static HyperNetworks have smaller network that generates weights (kernels)
Static HyperNetworks have smaller networks that generate weights (kernels)
of a convolutional network. Dynamic HyperNetworks generate parameters of a
recurrent neural network
for each step. This is an implementation of the latter.</p>
<h2>Dynamic HyperNetworks</h2>
<p>In an RNN the parameters stay constant for each step.
<p>In a RNN the parameters stay constant for each step.
Dynamic HyperNetworks generate different parameters for each step.
HyperLSTM has the structure of a LSTM but the parameters of
each step are changed by a smaller LSTM network.</p>
@ -101,7 +101,7 @@ $W_{hz}$ is a 3-d tensor parameter and $\langle . \rangle$ is a tensor-vector mu
$z_h$ is usually a linear transformation of the output of the smaller recurrent network.</p>
<h3>Weight scaling instead of computing</h3>
<p>Large recurrent networks have large dynamically computed parameters.
These are calculated using a linear transformation of feature vector $z$.
These are calculated using linear transformation of feature vector $z$.
And this transformation requires an even larger weight tensor.
That is, when $\color{cyan}{W_h}$ has shape $N_h \times N_h$,
$W_{hz}$ will be $N_h \times N_h \times N_z$.</p>
@ -140,7 +140,7 @@ where $\odot$ stands for element-wise multiplication.</p>
<a href='#section-1'>#</a>
</div>
<h2>HyperLSTM Cell</h2>
<p>For HyperLSTM the smaller network and the larger networks both have the LSTM structure.
<p>For HyperLSTM the smaller network and the larger network both have the LSTM structure.
This is defined in Appendix A.2.2 in the paper.</p>
</div>
<div class='code'>
@ -156,10 +156,10 @@ This is defined in Appendix A.2.2 in the paper.</p>
<code>hidden_size</code> is the size of the LSTM, and
<code>hyper_size</code> is the size of the smaller LSTM that alters the weights of the larger outer LSTM.
<code>n_z</code> is the size of the feature vectors used to alter the LSTM weights.</p>
<p>We use the output of the smaller LSTM to computer $z_h^{i,f,g,o}$, $z_x^{i,f,g,o}$ and
<p>We use the output of the smaller LSTM to compute $z_h^{i,f,g,o}$, $z_x^{i,f,g,o}$ and
$z_b^{i,f,g,o}$ using linear transformations.
We calculate $d_h^{i,f,g,o}(z_h^{i,f,g,o})$, $d_x^{i,f,g,o}(z_x^{i,f,g,o})$, and
$d_b^{i,f,g,o}(z_b^{i,f,g,o})$ from these again using linear transformations.
$d_b^{i,f,g,o}(z_b^{i,f,g,o})$ from these, using linear transformations again.
These are then used to scale the rows of weight and bias tensors of the main LSTM.</p>
<p>📝 Since the computation of $z$ and $d$ are two sequential linear transformations
these can be combined into a single linear transformation.
@ -186,7 +186,7 @@ in the paper.</p>
<div class='section-link'>
<a href='#section-4'>#</a>
</div>
<p>The input to the hyper lstm is
<p>The input to the hyperLSTM is
<script type="math/tex; mode=display">
\hat{x}_t = \begin{pmatrix}
h_{t-1} \\
@ -195,7 +195,7 @@ x_t
</script>
where $x_t$ is the input and $h_{t-1}$ is the output of the outer LSTM at previous step.
So the input size is <code>hidden_size + input_size</code>.</p>
<p>The output of hyper lstm is $\hat{h}_t$ and $\hat{c}_t$.</p>
<p>The output of hyperLSTM is $\hat{h}_t$ and $\hat{c}_t$.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">119</span> <span class="bp">self</span><span class="o">.</span><span class="n">hyper</span> <span class="o">=</span> <span class="n">LSTMCell</span><span class="p">(</span><span class="n">hidden_size</span> <span class="o">+</span> <span class="n">input_size</span><span class="p">,</span> <span class="n">hyper_size</span><span class="p">,</span> <span class="n">layer_norm</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span></pre></div>
@ -210,7 +210,7 @@ So the input size is <code>hidden_size + input_size</code>.</p>
<script type="math/tex; mode=display">z_h^{i,f,g,o} = lin_{h}^{i,f,g,o}(\hat{h}_t)</script>
🤔 In the paper it was specified as
<script type="math/tex; mode=display">z_h^{i,f,g,o} = lin_{h}^{i,f,g,o}(\hat{h}_{\color{red}{t-1}})</script>
I feel that&rsquo;s a typo.</p>
I feel that it&rsquo;s a typo.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">125</span> <span class="bp">self</span><span class="o">.</span><span class="n">z_h</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hyper_size</span><span class="p">,</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">n_z</span><span class="p">)</span></pre></div>

View File

@ -96,8 +96,7 @@ In the update, some features of $c$ are cleared with a forget gate $f$,
and some features $i$ are added through a gate $g$.</p>
<p>The new short term memory is the $\tanh$ of the long-term memory
multiplied by the output gate $o$.</p>
<p>Note that the cell doesn&rsquo;t look at long term memory $c$ when doing the update
for the update. It only modifies it.
<p>Note that the cell doesn&rsquo;t look at long term memory $c$ when doing the update. It only modifies it.
Also $c$ never goes through a linear transformation.
This is what solves vanishing and exploding gradients.</p>
<p>Here&rsquo;s the update rule.</p>
@ -131,8 +130,8 @@ o_t &= lin_x^o(x_t) + lin_h^o(h_{t-1})
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">59</span> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">layer_norm</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">False</span><span class="p">):</span>
<span class="lineno">60</span> <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span></pre></div>
<div class="highlight"><pre><span class="lineno">58</span> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">layer_norm</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="kc">False</span><span class="p">):</span>
<span class="lineno">59</span> <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span></pre></div>
</div>
</div>
<div class='section' id='section-3'>
@ -155,7 +154,7 @@ One of them doesn&rsquo;t need a bias since we add the transformations.</p>
<p>This combines $lin_x^i$, $lin_x^f$, $lin_x^g$, and $lin_x^o$ transformations.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">66</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_lin</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">,</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">hidden_size</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">65</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_lin</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">,</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">hidden_size</span><span class="p">)</span></pre></div>
</div>
</div>
<div class='section' id='section-5'>
@ -166,7 +165,7 @@ One of them doesn&rsquo;t need a bias since we add the transformations.</p>
<p>This combines $lin_h^i$, $lin_h^f$, $lin_h^g$, and $lin_h^o$ transformations.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">68</span> <span class="bp">self</span><span class="o">.</span><span class="n">input_lin</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">input_size</span><span class="p">,</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">hidden_size</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">67</span> <span class="bp">self</span><span class="o">.</span><span class="n">input_lin</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">input_size</span><span class="p">,</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">hidden_size</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span></pre></div>
</div>
</div>
<div class='section' id='section-6'>
@ -180,12 +179,12 @@ $i$, $f$, $g$ and $o$ embeddings are normalized and $c_t$ is normalized in
$h_t = o_t \odot \tanh(\mathop{LN}(c_t))$</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">75</span> <span class="k">if</span> <span class="n">layer_norm</span><span class="p">:</span>
<span class="lineno">76</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer_norm</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ModuleList</span><span class="p">([</span><span class="n">nn</span><span class="o">.</span><span class="n">LayerNorm</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)])</span>
<span class="lineno">77</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer_norm_c</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">LayerNorm</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">)</span>
<span class="lineno">78</span> <span class="k">else</span><span class="p">:</span>
<span class="lineno">79</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer_norm</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ModuleList</span><span class="p">([</span><span class="n">nn</span><span class="o">.</span><span class="n">Identity</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)])</span>
<span class="lineno">80</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer_norm_c</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Identity</span><span class="p">()</span></pre></div>
<div class="highlight"><pre><span class="lineno">74</span> <span class="k">if</span> <span class="n">layer_norm</span><span class="p">:</span>
<span class="lineno">75</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer_norm</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ModuleList</span><span class="p">([</span><span class="n">nn</span><span class="o">.</span><span class="n">LayerNorm</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)])</span>
<span class="lineno">76</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer_norm_c</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">LayerNorm</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">)</span>
<span class="lineno">77</span> <span class="k">else</span><span class="p">:</span>
<span class="lineno">78</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer_norm</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ModuleList</span><span class="p">([</span><span class="n">nn</span><span class="o">.</span><span class="n">Identity</span><span class="p">()</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)])</span>
<span class="lineno">79</span> <span class="bp">self</span><span class="o">.</span><span class="n">layer_norm_c</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Identity</span><span class="p">()</span></pre></div>
</div>
</div>
<div class='section' id='section-7'>
@ -196,7 +195,7 @@ $h_t = o_t \odot \tanh(\mathop{LN}(c_t))$</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">82</span> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">h</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">c</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">):</span></pre></div>
<div class="highlight"><pre><span class="lineno">81</span> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">h</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">c</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">):</span></pre></div>
</div>
</div>
<div class='section' id='section-8'>
@ -208,7 +207,7 @@ $h_t = o_t \odot \tanh(\mathop{LN}(c_t))$</p>
using the same linear layers.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">85</span> <span class="n">ifgo</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_lin</span><span class="p">(</span><span class="n">h</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">input_lin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">84</span> <span class="n">ifgo</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_lin</span><span class="p">(</span><span class="n">h</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">input_lin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span></pre></div>
</div>
</div>
<div class='section' id='section-9'>
@ -219,7 +218,7 @@ using the same linear layers.</p>
<p>Each layer produces an output of 4 times the <code>hidden_size</code> and we split them</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">87</span> <span class="n">ifgo</span> <span class="o">=</span> <span class="n">ifgo</span><span class="o">.</span><span class="n">chunk</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">86</span> <span class="n">ifgo</span> <span class="o">=</span> <span class="n">ifgo</span><span class="o">.</span><span class="n">chunk</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span></pre></div>
</div>
</div>
<div class='section' id='section-10'>
@ -230,7 +229,7 @@ using the same linear layers.</p>
<p>Apply layer normalization (not in original paper, but gives better results)</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">90</span> <span class="n">ifgo</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">layer_norm</span><span class="p">[</span><span class="n">i</span><span class="p">](</span><span class="n">ifgo</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)]</span></pre></div>
<div class="highlight"><pre><span class="lineno">89</span> <span class="n">ifgo</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">layer_norm</span><span class="p">[</span><span class="n">i</span><span class="p">](</span><span class="n">ifgo</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)]</span></pre></div>
</div>
</div>
<div class='section' id='section-11'>
@ -243,7 +242,7 @@ using the same linear layers.</p>
</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">93</span> <span class="n">i</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">o</span> <span class="o">=</span> <span class="n">ifgo</span></pre></div>
<div class="highlight"><pre><span class="lineno">92</span> <span class="n">i</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">o</span> <span class="o">=</span> <span class="n">ifgo</span></pre></div>
</div>
</div>
<div class='section' id='section-12'>
@ -256,7 +255,7 @@ using the same linear layers.</p>
</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">96</span> <span class="n">c_next</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">c</span> <span class="o">+</span> <span class="n">torch</span><span class="o">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">*</span> <span class="n">torch</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">g</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">95</span> <span class="n">c_next</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">f</span><span class="p">)</span> <span class="o">*</span> <span class="n">c</span> <span class="o">+</span> <span class="n">torch</span><span class="o">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">*</span> <span class="n">torch</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">g</span><span class="p">)</span></pre></div>
</div>
</div>
<div class='section' id='section-13'>
@ -269,9 +268,9 @@ using the same linear layers.</p>
Optionally, apply layer norm to $c_t$</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">100</span> <span class="n">h_next</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">o</span><span class="p">)</span> <span class="o">*</span> <span class="n">torch</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">layer_norm_c</span><span class="p">(</span><span class="n">c_next</span><span class="p">))</span>
<span class="lineno">101</span>
<span class="lineno">102</span> <span class="k">return</span> <span class="n">h_next</span><span class="p">,</span> <span class="n">c_next</span></pre></div>
<div class="highlight"><pre><span class="lineno">99</span> <span class="n">h_next</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">o</span><span class="p">)</span> <span class="o">*</span> <span class="n">torch</span><span class="o">.</span><span class="n">tanh</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">layer_norm_c</span><span class="p">(</span><span class="n">c_next</span><span class="p">))</span>
<span class="lineno">100</span>
<span class="lineno">101</span> <span class="k">return</span> <span class="n">h_next</span><span class="p">,</span> <span class="n">c_next</span></pre></div>
</div>
</div>
<div class='section' id='section-14'>
@ -282,7 +281,7 @@ Optionally, apply layer norm to $c_t$</p>
<h2>Multilayer LSTM</h2>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">105</span><span class="k">class</span> <span class="nc">LSTM</span><span class="p">(</span><span class="n">Module</span><span class="p">):</span></pre></div>
<div class="highlight"><pre><span class="lineno">104</span><span class="k">class</span> <span class="nc">LSTM</span><span class="p">(</span><span class="n">Module</span><span class="p">):</span></pre></div>
</div>
</div>
<div class='section' id='section-15'>
@ -293,7 +292,7 @@ Optionally, apply layer norm to $c_t$</p>
<p>Create a network of <code>n_layers</code> of LSTM.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">110</span> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">n_layers</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span></pre></div>
<div class="highlight"><pre><span class="lineno">109</span> <span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">n_layers</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span></pre></div>
</div>
</div>
<div class='section' id='section-16'>
@ -304,9 +303,9 @@ Optionally, apply layer norm to $c_t$</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">115</span> <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="lineno">116</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_layers</span> <span class="o">=</span> <span class="n">n_layers</span>
<span class="lineno">117</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_size</span> <span class="o">=</span> <span class="n">hidden_size</span></pre></div>
<div class="highlight"><pre><span class="lineno">114</span> <span class="nb">super</span><span class="p">()</span><span class="o">.</span><span class="fm">__init__</span><span class="p">()</span>
<span class="lineno">115</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_layers</span> <span class="o">=</span> <span class="n">n_layers</span>
<span class="lineno">116</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_size</span> <span class="o">=</span> <span class="n">hidden_size</span></pre></div>
</div>
</div>
<div class='section' id='section-17'>
@ -318,8 +317,8 @@ Optionally, apply layer norm to $c_t$</p>
Rest of the layers get the input from the layer below</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">120</span> <span class="bp">self</span><span class="o">.</span><span class="n">cells</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ModuleList</span><span class="p">([</span><span class="n">LSTMCell</span><span class="p">(</span><span class="n">input_size</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">)]</span> <span class="o">+</span>
<span class="lineno">121</span> <span class="p">[</span><span class="n">LSTMCell</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_layers</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)])</span></pre></div>
<div class="highlight"><pre><span class="lineno">119</span> <span class="bp">self</span><span class="o">.</span><span class="n">cells</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">ModuleList</span><span class="p">([</span><span class="n">LSTMCell</span><span class="p">(</span><span class="n">input_size</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">)]</span> <span class="o">+</span>
<span class="lineno">120</span> <span class="p">[</span><span class="n">LSTMCell</span><span class="p">(</span><span class="n">hidden_size</span><span class="p">,</span> <span class="n">hidden_size</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_layers</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)])</span></pre></div>
</div>
</div>
<div class='section' id='section-18'>
@ -331,7 +330,7 @@ Rest of the layers get the input from the layer below</p>
<code>state</code> is a tuple of $h$ and $c$, each with a shape of <code>[batch_size, hidden_size]</code>.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">123</span> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Tuple</span><span class="p">[</span><span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">):</span></pre></div>
<div class="highlight"><pre><span class="lineno">122</span> <span class="k">def</span> <span class="fm">__call__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Tuple</span><span class="p">[</span><span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">Tensor</span><span class="p">]]</span> <span class="o">=</span> <span class="kc">None</span><span class="p">):</span></pre></div>
</div>
</div>
<div class='section' id='section-19'>
@ -342,7 +341,7 @@ Rest of the layers get the input from the layer below</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">128</span> <span class="n">n_steps</span><span class="p">,</span> <span class="n">batch_size</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span></pre></div>
<div class="highlight"><pre><span class="lineno">127</span> <span class="n">n_steps</span><span class="p">,</span> <span class="n">batch_size</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span></pre></div>
</div>
</div>
<div class='section' id='section-20'>
@ -353,11 +352,11 @@ Rest of the layers get the input from the layer below</p>
<p>Initialize the state if <code>None</code></p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">131</span> <span class="k">if</span> <span class="n">state</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
<span class="lineno">132</span> <span class="n">h</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">new_zeros</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_size</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">n_layers</span><span class="p">)]</span>
<span class="lineno">133</span> <span class="n">c</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">new_zeros</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_size</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">n_layers</span><span class="p">)]</span>
<span class="lineno">134</span> <span class="k">else</span><span class="p">:</span>
<span class="lineno">135</span> <span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span> <span class="o">=</span> <span class="n">state</span></pre></div>
<div class="highlight"><pre><span class="lineno">130</span> <span class="k">if</span> <span class="n">state</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
<span class="lineno">131</span> <span class="n">h</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">new_zeros</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_size</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">n_layers</span><span class="p">)]</span>
<span class="lineno">132</span> <span class="n">c</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">new_zeros</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">hidden_size</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">n_layers</span><span class="p">)]</span>
<span class="lineno">133</span> <span class="k">else</span><span class="p">:</span>
<span class="lineno">134</span> <span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span> <span class="o">=</span> <span class="n">state</span></pre></div>
</div>
</div>
<div class='section' id='section-21'>
@ -369,7 +368,7 @@ Rest of the layers get the input from the layer below</p>
📝 You can just work with the tensor itself but this is easier to debug</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">138</span> <span class="n">h</span><span class="p">,</span> <span class="n">c</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">unbind</span><span class="p">(</span><span class="n">h</span><span class="p">)),</span> <span class="nb">list</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">unbind</span><span class="p">(</span><span class="n">c</span><span class="p">))</span></pre></div>
<div class="highlight"><pre><span class="lineno">137</span> <span class="n">h</span><span class="p">,</span> <span class="n">c</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">unbind</span><span class="p">(</span><span class="n">h</span><span class="p">)),</span> <span class="nb">list</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">unbind</span><span class="p">(</span><span class="n">c</span><span class="p">))</span></pre></div>
</div>
</div>
<div class='section' id='section-22'>
@ -380,8 +379,8 @@ Rest of the layers get the input from the layer below</p>
<p>Array to collect the outputs of the final layer at each time step.</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">141</span> <span class="n">out</span> <span class="o">=</span> <span class="p">[]</span>
<span class="lineno">142</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_steps</span><span class="p">):</span></pre></div>
<div class="highlight"><pre><span class="lineno">140</span> <span class="n">out</span> <span class="o">=</span> <span class="p">[]</span>
<span class="lineno">141</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_steps</span><span class="p">):</span></pre></div>
</div>
</div>
<div class='section' id='section-23'>
@ -392,7 +391,7 @@ Rest of the layers get the input from the layer below</p>
<p>Input to the first layer is the input itself</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">144</span> <span class="n">inp</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="n">t</span><span class="p">]</span></pre></div>
<div class="highlight"><pre><span class="lineno">143</span> <span class="n">inp</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="n">t</span><span class="p">]</span></pre></div>
</div>
</div>
<div class='section' id='section-24'>
@ -403,7 +402,7 @@ Rest of the layers get the input from the layer below</p>
<p>Loop through the layers</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">146</span> <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">n_layers</span><span class="p">):</span></pre></div>
<div class="highlight"><pre><span class="lineno">145</span> <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">n_layers</span><span class="p">):</span></pre></div>
</div>
</div>
<div class='section' id='section-25'>
@ -414,7 +413,7 @@ Rest of the layers get the input from the layer below</p>
<p>Get the state of the layer</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">148</span> <span class="n">h</span><span class="p">[</span><span class="n">layer</span><span class="p">],</span> <span class="n">c</span><span class="p">[</span><span class="n">layer</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cells</span><span class="p">[</span><span class="n">layer</span><span class="p">](</span><span class="n">inp</span><span class="p">,</span> <span class="n">h</span><span class="p">[</span><span class="n">layer</span><span class="p">],</span> <span class="n">c</span><span class="p">[</span><span class="n">layer</span><span class="p">])</span></pre></div>
<div class="highlight"><pre><span class="lineno">147</span> <span class="n">h</span><span class="p">[</span><span class="n">layer</span><span class="p">],</span> <span class="n">c</span><span class="p">[</span><span class="n">layer</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cells</span><span class="p">[</span><span class="n">layer</span><span class="p">](</span><span class="n">inp</span><span class="p">,</span> <span class="n">h</span><span class="p">[</span><span class="n">layer</span><span class="p">],</span> <span class="n">c</span><span class="p">[</span><span class="n">layer</span><span class="p">])</span></pre></div>
</div>
</div>
<div class='section' id='section-26'>
@ -425,7 +424,7 @@ Rest of the layers get the input from the layer below</p>
<p>Input to the next layer is the state of this layer</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">150</span> <span class="n">inp</span> <span class="o">=</span> <span class="n">h</span><span class="p">[</span><span class="n">layer</span><span class="p">]</span></pre></div>
<div class="highlight"><pre><span class="lineno">149</span> <span class="n">inp</span> <span class="o">=</span> <span class="n">h</span><span class="p">[</span><span class="n">layer</span><span class="p">]</span></pre></div>
</div>
</div>
<div class='section' id='section-27'>
@ -436,7 +435,7 @@ Rest of the layers get the input from the layer below</p>
<p>Collect the output $h$ of the final layer</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">152</span> <span class="n">out</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">h</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span></pre></div>
<div class="highlight"><pre><span class="lineno">151</span> <span class="n">out</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">h</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span></pre></div>
</div>
</div>
<div class='section' id='section-28'>
@ -447,11 +446,11 @@ Rest of the layers get the input from the layer below</p>
<p>Stack the outputs and states</p>
</div>
<div class='code'>
<div class="highlight"><pre><span class="lineno">155</span> <span class="n">out</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>
<span class="lineno">156</span> <span class="n">h</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">h</span><span class="p">)</span>
<span class="lineno">157</span> <span class="n">c</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="lineno">158</span>
<span class="lineno">159</span> <span class="k">return</span> <span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span></pre></div>
<div class="highlight"><pre><span class="lineno">154</span> <span class="n">out</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>
<span class="lineno">155</span> <span class="n">h</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">h</span><span class="p">)</span>
<span class="lineno">156</span> <span class="n">c</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="lineno">157</span>
<span class="lineno">158</span> <span class="k">return</span> <span class="n">out</span><span class="p">,</span> <span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span></pre></div>
</div>
</div>
</div>

View File

@ -81,12 +81,12 @@
<li>Tricky for RNNs. Do you need different normalizations for each step?</li>
<li>Doesn&rsquo;t work with small batch sizes;
large NLP models are usually trained with small batch sizes.</li>
<li>Need to compute means and variances across devices in distributed training</li>
<li>Need to compute means and variances across devices in distributed training.</li>
</ul>
<h2>Layer Normalization</h2>
<p>Layer normalization is a simpler normalization method that works
on a wider range of settings.
Layer normalization transformers the inputs to have zero mean and unit variance
Layer normalization transforms the inputs to have zero mean and unit variance
across the features.
<em>Note that batch normalization fixes the zero mean and unit variance for each element.</em>
Layer normalization does it for each batch across all elements.</p>

View File

@ -81,12 +81,12 @@
<li>Tricky for RNNs. Do you need different normalizations for each step?</li>
<li>Doesn&rsquo;t work with small batch sizes;
large NLP models are usually trained with small batch sizes.</li>
<li>Need to compute means and variances across devices in distributed training</li>
<li>Need to compute means and variances across devices in distributed training.</li>
</ul>
<h2>Layer Normalization</h2>
<p>Layer normalization is a simpler normalization method that works
on a wider range of settings.
Layer normalization transformers the inputs to have zero mean and unit variance
Layer normalization transforms the inputs to have zero mean and unit variance
across the features.
<em>Note that batch normalization fixes the zero mean and unit variance for each element.</em>
Layer normalization does it for each batch across all elements.</p>

View File

@ -78,7 +78,7 @@
<url>
<loc>https://nn.labml.ai/hypernetworks/hyper_lstm.html</loc>
<lastmod>2021-01-30T16:30:00+00:00</lastmod>
<lastmod>2021-02-12T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>
@ -568,7 +568,7 @@
<url>
<loc>https://nn.labml.ai/lstm/index.html</loc>
<lastmod>2021-01-30T16:30:00+00:00</lastmod>
<lastmod>2021-02-11T16:30:00+00:00</lastmod>
<priority>1.00</priority>
</url>

View File

@ -16,13 +16,13 @@ This is a [PyTorch](https://pytorch.org) implementation of
* Tricky for RNNs. Do you need different normalizations for each step?
* Doesn't work with small batch sizes;
large NLP models are usually trained with small batch sizes.
* Need to compute means and variances across devices in distributed training
* Need to compute means and variances across devices in distributed training.
## Layer Normalization
Layer normalization is a simpler normalization method that works
on a wider range of settings.
Layer normalization transformers the inputs to have zero mean and unit variance
Layer normalization transforms the inputs to have zero mean and unit variance
across the features.
*Note that batch normalization fixes the zero mean and unit variance for each element.*
Layer normalization does it for each batch across all elements.

View File

@ -9,13 +9,13 @@ This is a [PyTorch](https://pytorch.org) implementation of
* Tricky for RNNs. Do you need different normalizations for each step?
* Doesn't work with small batch sizes;
large NLP models are usually trained with small batch sizes.
* Need to compute means and variances across devices in distributed training
* Need to compute means and variances across devices in distributed training.
## Layer Normalization
Layer normalization is a simpler normalization method that works
on a wider range of settings.
Layer normalization transformers the inputs to have zero mean and unit variance
Layer normalization transforms the inputs to have zero mean and unit variance
across the features.
*Note that batch normalization fixes the zero mean and unit variance for each element.*
Layer normalization does it for each batch across all elements.