mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-11-01 20:28:41 +08:00
📚 switch readme
This commit is contained in:
@ -10,18 +10,18 @@ summary: >
|
||||
This is a miniature [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
|
||||
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
|
||||
It does single GPU training but we implement the concept of switching as described in the paper.
|
||||
It does single GPU training, but we implement the concept of switching as described in the paper.
|
||||
|
||||
The Switch Transformer uses different parameters for each token by switching among parameters,
|
||||
based on the token. So only a fraction of parameters is chosen for each token, so you
|
||||
can have more parameters but less computational cost.
|
||||
|
||||
The switching happens at the Position-wise Feedforward network (FFN) of of each transformer block.
|
||||
The switching happens at the Position-wise Feedforward network (FFN) of each transformer block.
|
||||
Position-wise feedforward network is a two sequentially fully connected layers.
|
||||
In switch transformer we have multiple FFNs (multiple experts) and
|
||||
we chose which one to use based on a router.
|
||||
In switch transformer we have multiple FFNs (multiple experts),
|
||||
and we chose which one to use based on a router.
|
||||
The outputs a set of probabilities for picking a FFN,
|
||||
and we pick the one with highest probability and only evaluates that.
|
||||
and we pick the one with the highest probability and only evaluates that.
|
||||
So essentially the computational cost is same as having a single FFN.
|
||||
In our implementation this doesn't parallelize well when you have many or large FFNs since it's all
|
||||
happening on a single GPU.
|
||||
|
||||
29
labml_nn/transformers/switch/readme.md
Normal file
29
labml_nn/transformers/switch/readme.md
Normal file
@ -0,0 +1,29 @@
|
||||
# [Switch Transformer](https://nn.labml.ai/transformers/switch/index.html)
|
||||
|
||||
This is a miniature [PyTorch](https://pytorch.org) implementation of the paper
|
||||
[Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961).
|
||||
Our implementation only has a few million parameters and doesn't do model parallel distributed training.
|
||||
It does single GPU training, but we implement the concept of switching as described in the paper.
|
||||
|
||||
The Switch Transformer uses different parameters for each token by switching among parameters,
|
||||
based on the token. So only a fraction of parameters is chosen for each token, so you
|
||||
can have more parameters but less computational cost.
|
||||
|
||||
The switching happens at the Position-wise Feedforward network (FFN) of each transformer block.
|
||||
Position-wise feedforward network is a two sequentially fully connected layers.
|
||||
In switch transformer we have multiple FFNs (multiple experts),
|
||||
and we chose which one to use based on a router.
|
||||
The outputs a set of probabilities for picking a FFN,
|
||||
and we pick the one with the highest probability and only evaluates that.
|
||||
So essentially the computational cost is same as having a single FFN.
|
||||
In our implementation this doesn't parallelize well when you have many or large FFNs since it's all
|
||||
happening on a single GPU.
|
||||
In a distributed setup you would have each FFN (each very large) on a different device.
|
||||
|
||||
The paper introduces another loss term to balance load among the experts (FFNs) and
|
||||
discusses dropping tokens when routing is not balanced.
|
||||
|
||||
Here's [the training code](experiment.html) and a notebook for training a switch transformer on Tiny Shakespeare dataset.
|
||||
|
||||
[](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/switch/experiment.ipynb)
|
||||
[](https://web.lab-ml.com/run?uuid=c4656c605b9311eba13d0242ac1c0002)
|
||||
Reference in New Issue
Block a user