This commit is contained in:
Varuna Jayasiri
2021-01-29 15:15:44 +05:30
parent a0fa963c60
commit 3161c23592
2 changed files with 2 additions and 2 deletions

View File

@ -90,7 +90,7 @@ Instead it keeps weighted sum of the output of all layers.
This reduces the memory used for caching during prediction.
The first half of this file implements this.</p>
<p>The updated feedback transformer shares weights $W^l_k$ and $W^l_v$ used
to calculate keys and values for among the layers.
to calculate keys and values among the layers.
We then calculate the keys and values for each step only once and keep
them cached.
The <a href="#shared_kv">second half</a> of this file implements this.

View File

@ -28,7 +28,7 @@ This reduces the memory used for caching during prediction.
The first half of this file implements this.
The updated feedback transformer shares weights $W^l_k$ and $W^l_v$ used
to calculate keys and values for among the layers.
to calculate keys and values among the layers.
We then calculate the keys and values for each step only once and keep
them cached.
The [second half](#shared_kv) of this file implements this.