From 137ab59eafe2f5956adaa4940faadf478806c93e Mon Sep 17 00:00:00 2001 From: Varuna Jayasiri Date: Sun, 24 Jan 2021 08:08:09 +0530 Subject: [PATCH] english --- labml_nn/transformers/switch/__init__.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/labml_nn/transformers/switch/__init__.py b/labml_nn/transformers/switch/__init__.py index 75ef9e4a..35f2930b 100644 --- a/labml_nn/transformers/switch/__init__.py +++ b/labml_nn/transformers/switch/__init__.py @@ -12,12 +12,12 @@ This is a miniature implementation of the paper Our implementation only has a few million parameters and doesn't do model parallel distributed training. It does single GPU training but we implement the concept of switching as described in the paper. -The Switch Transformer is uses different parameters for each tokens by switching among parameters, +The Switch Transformer uses different parameters for each token by switching among parameters, based on the token. So only a fraction of parameters is chosen for each token, so you -can have more parameters but a less computational cost. +can have more parameters but less computational cost. The switching happens at the Position-wise Feedforward network (FFN) of of each transformer block. -Position-wise feedforward network is a two sequential fully connected layers. +Position-wise feedforward network is a two sequentially fully connected layers. In switch transformer we have multiple FFNs (multiple experts) and we chose which one to use based on a router. The outputs a set of probabilities for picking a FFN,