mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-08-26 16:50:39 +08:00
english
This commit is contained in:
@ -12,12 +12,12 @@ This is a miniature implementation of the paper
|
|||||||
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
|
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
|
||||||
It does single GPU training but we implement the concept of switching as described in the paper.
|
It does single GPU training but we implement the concept of switching as described in the paper.
|
||||||
|
|
||||||
The Switch Transformer is uses different parameters for each tokens by switching among parameters,
|
The Switch Transformer uses different parameters for each token by switching among parameters,
|
||||||
based on the token. So only a fraction of parameters is chosen for each token, so you
|
based on the token. So only a fraction of parameters is chosen for each token, so you
|
||||||
can have more parameters but a less computational cost.
|
can have more parameters but less computational cost.
|
||||||
|
|
||||||
The switching happens at the Position-wise Feedforward network (FFN) of of each transformer block.
|
The switching happens at the Position-wise Feedforward network (FFN) of of each transformer block.
|
||||||
Position-wise feedforward network is a two sequential fully connected layers.
|
Position-wise feedforward network is a two sequentially fully connected layers.
|
||||||
In switch transformer we have multiple FFNs (multiple experts) and
|
In switch transformer we have multiple FFNs (multiple experts) and
|
||||||
we chose which one to use based on a router.
|
we chose which one to use based on a router.
|
||||||
The outputs a set of probabilities for picking a FFN,
|
The outputs a set of probabilities for picking a FFN,
|
||||||
|
Reference in New Issue
Block a user