diff --git a/docs/normalization/batch_norm/index.html b/docs/normalization/batch_norm/index.html index ad1b211d..05ecbeda 100644 --- a/docs/normalization/batch_norm/index.html +++ b/docs/normalization/batch_norm/index.html @@ -82,7 +82,7 @@ network parameters during training. For example, let’s say there are two layers $l_1$ and $l_2$. During the beginning of the training $l_1$ outputs (inputs to $l_2$) could be in distribution $\mathcal{N}(0.5, 1)$. -Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$. +Then, after some training steps, it could move to $\mathcal{N}(0.6, 1.5)$. This is internal covariate shift.

Internal covariate shift will adversely affect training speed because the later layers ($l_2$ in the above example) have to adapt to this shifted distribution.

diff --git a/labml_nn/normalization/batch_norm/__init__.py b/labml_nn/normalization/batch_norm/__init__.py index eef75e2c..bf60e27e 100644 --- a/labml_nn/normalization/batch_norm/__init__.py +++ b/labml_nn/normalization/batch_norm/__init__.py @@ -18,7 +18,7 @@ network parameters during training. For example, let's say there are two layers $l_1$ and $l_2$. During the beginning of the training $l_1$ outputs (inputs to $l_2$) could be in distribution $\mathcal{N}(0.5, 1)$. -Then, after some training steps, it could move to $\mathcal{N}(0.5, 1)$. +Then, after some training steps, it could move to $\mathcal{N}(0.6, 1.5)$. This is *internal covariate shift*. Internal covariate shift will adversely affect training speed because the later layers