diff --git a/docs/capsule_networks/index.html b/docs/capsule_networks/index.html index e563ece7..334c1d50 100644 --- a/docs/capsule_networks/index.html +++ b/docs/capsule_networks/index.html @@ -68,7 +68,7 @@
This is a PyTorch implementation/tutorial of -Dynamic Routing Between Capsules.
+Dynamic Routing Between Capsules.Capsule network is a neural network architecture that embeds features as capsules and routes them with a voting mechanism to next layer of capsules.
Unlike in other implementations of models, we’ve included a sample, because diff --git a/docs/capsule_networks/mnist.html b/docs/capsule_networks/mnist.html index e2feb1fb..8b64caf0 100644 --- a/docs/capsule_networks/mnist.html +++ b/docs/capsule_networks/mnist.html @@ -69,7 +69,7 @@
This is an annotated PyTorch code to classify MNIST digits with PyTorch.
This paper implements the experiment described in paper -Dynamic Routing Between Capsules.
+Dynamic Routing Between Capsules.14from typing import Any
diff --git a/docs/capsule_networks/readme.html b/docs/capsule_networks/readme.html
index 4d6e991f..215aa730 100644
--- a/docs/capsule_networks/readme.html
+++ b/docs/capsule_networks/readme.html
@@ -68,7 +68,7 @@
This is a PyTorch implementation/tutorial of -Dynamic Routing Between Capsules.
+Dynamic Routing Between Capsules.Capsule network is a neural network architecture that embeds features as capsules and routes them with a voting mechanism to next layer of capsules.
Unlike in other implementations of models, we’ve included a sample, because diff --git a/docs/gan/cycle_gan/index.html b/docs/gan/cycle_gan/index.html index 9ae58422..5f706732 100644 --- a/docs/gan/cycle_gan/index.html +++ b/docs/gan/cycle_gan/index.html @@ -69,7 +69,7 @@
This is a PyTorch implementation/tutorial of the paper -Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.
+Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.I’ve taken pieces of code from eriklindernoren/PyTorch-GAN. It is a very good resource if you want to checkout other GAN variations too.
Cycle GAN does image-to-image translation. diff --git a/docs/gan/cycle_gan/readme.html b/docs/gan/cycle_gan/readme.html index 79637c15..56ebf408 100644 --- a/docs/gan/cycle_gan/readme.html +++ b/docs/gan/cycle_gan/readme.html @@ -69,7 +69,7 @@
This is a PyTorch implementation/tutorial of the paper -Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.
+Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.This is a PyTorch implementation of paper -Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.
+Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.This implementation is based on the PyTorch DCGAN Tutorial.
This is a PyTorch implementation of paper -Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.
+Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.This is an implementation of -Generative Adversarial Networks.
+Generative Adversarial Networks.The generator, $G(\pmb{z}; \theta_g)$ generates samples that match the distribution of data, while the discriminator, $D(\pmb{x}; \theta_g)$ gives the probability that $\pmb{x}$ came from data rather than $G$.
diff --git a/docs/gan/original/readme.html b/docs/gan/original/readme.html index 0bc677c1..152d29e8 100644 --- a/docs/gan/original/readme.html +++ b/docs/gan/original/readme.html @@ -69,7 +69,7 @@This is an annotated implementation of -Generative Adversarial Networks.
+Generative Adversarial Networks.This is a PyTorch implementation of the paper - Analyzing and Improving the Image Quality of StyleGAN + Analyzing and Improving the Image Quality of StyleGAN which introduces StyleGAN 2. StyleGAN 2 is an improvement over StyleGAN from the paper - A Style-Based Generator Architecture for Generative Adversarial Networks. + A Style-Based Generator Architecture for Generative Adversarial Networks. And StyleGAN is based on Progressive GAN from the paper - Progressive Growing of GANs for Improved Quality, Stability, and Variation. + Progressive Growing of GANs for Improved Quality, Stability, and Variation. All three papers are from the same authors from NVIDIA AI.
Our implementation is a minimalistic StyleGAN 2 model training code. Only single GPU training is supported to keep the implementation simple. @@ -1695,7 +1695,7 @@ since we want to calculate the standard deviation for each feature.
The down-sample operation smoothens each feature channel and scale $2 \times$ using bilinear interpolation. This is based on the paper - Making Convolutional Networks Shift-Invariant Again.
+ Making Convolutional Networks Shift-Invariant Again.645class DownSample(nn.Module):
The up-sample operation scales the image up by $2 \times$ and smoothens each feature channel. This is based on the paper - Making Convolutional Networks Shift-Invariant Again.
+ Making Convolutional Networks Shift-Invariant Again.668class UpSample(nn.Module):
This is the $R_1$ regularization penality from the paper -Which Training Methods for GANs do actually Converge?.
+Which Training Methods for GANs do actually Converge?.diff --git a/docs/gan/stylegan/readme.html b/docs/gan/stylegan/readme.html index 6c946567..999171ae 100644 --- a/docs/gan/stylegan/readme.html +++ b/docs/gan/stylegan/readme.html @@ -69,12 +69,12 @@
This is a PyTorch implementation of the paper - Analyzing and Improving the Image Quality of StyleGAN + Analyzing and Improving the Image Quality of StyleGAN which introduces StyleGAN2. StyleGAN 2 is an improvement over StyleGAN from the paper - A Style-Based Generator Architecture for Generative Adversarial Networks. + A Style-Based Generator Architecture for Generative Adversarial Networks. And StyleGAN is based on Progressive GAN from the paper - Progressive Growing of GANs for Improved Quality, Stability, and Variation. + Progressive Growing of GANs for Improved Quality, Stability, and Variation. All three papers are from the same authors from NVIDIA AI.
This is an implementation of -Improved Training of Wasserstein GANs.
+Improved Training of Wasserstein GANs.WGAN suggests clipping weights to enforce Lipschitz constraint on the discriminator network (critic). This and other weight constraints like L2 norm clipping, weight normalization, @@ -82,7 +82,7 @@ L1, L2 weight decay have problems:
The paper Improved Training of Wasserstein GANs +
The paper Improved Training of Wasserstein GANs proposal a better way to improve Lipschitz constraint, a gradient penalty.
-We use double Q-learning, where +We use double Q-learning, where the $\operatorname{argmax}$ is taken from $\color{cyan}{\theta_i}$ and the value is taken from $\color{orange}{\theta_i^{-}}$.
And the loss function becomes, diff --git a/docs/rl/dqn/model.html b/docs/rl/dqn/model.html index 71e92fba..306d814a 100644 --- a/docs/rl/dqn/model.html +++ b/docs/rl/dqn/model.html @@ -82,7 +82,7 @@ #
We are using a dueling network +
We are using a dueling network to calculate Q-values. Intuition behind dueling network architecture is that in most states the action doesn’t matter, diff --git a/docs/rl/dqn/replay_buffer.html b/docs/rl/dqn/replay_buffer.html index 03693cf8..18c310db 100644 --- a/docs/rl/dqn/replay_buffer.html +++ b/docs/rl/dqn/replay_buffer.html @@ -68,7 +68,7 @@ #
This implements paper Prioritized experience replay, +
This implements paper Prioritized experience replay, using a binary segment tree.
Prioritized experience replay +
Prioritized experience replay samples important transitions more frequently. The transitions are prioritized by the Temporal Difference error (td error), $\delta$.
We sample transition $i$ with probability, diff --git a/docs/rl/ppo/gae.html b/docs/rl/ppo/gae.html index db2f96a6..f8f78c27 100644 --- a/docs/rl/ppo/gae.html +++ b/docs/rl/ppo/gae.html @@ -69,7 +69,7 @@
This is a PyTorch implementation of paper -Generalized Advantage Estimation.
+Generalized Advantage Estimation.You can find an experiment that uses it here.
This is a PyTorch implementation of -Proximal Policy Optimization - PPO.
+Proximal Policy Optimization - PPO.PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a single sample causes problems @@ -171,7 +171,7 @@ J(\pi_\theta) - J(\pi_{\theta_{OLD}}) The error we introduce to $J(\pi_\theta) - J(\pi_{\theta_{OLD}})$ by this assumption is bound by the KL divergence between $\pi_\theta$ and $\pi_{\theta_{OLD}}$. -Constrained Policy Optimization +Constrained Policy Optimization shows the proof of this. I haven’t read it.
where $\Phi(x) = P(X \le x), X \sim \mathcal{N}(0,1)$
-It was introduced in paper Gaussian Error Linear Units.
+It was introduced in paper Gaussian Error Linear Units.
62@option(FeedForwardConfigs.activation, 'GELU')
@@ -294,7 +294,7 @@
These are variants with gated hidden layers for the FFN -as introduced in paper GLU Variants Improve Transformer. +as introduced in paper GLU Variants Improve Transformer. We have omitted the bias terms as specified in the paper.
The paper -Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch +Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch finds similarities between linear self-attention and fast weight systems and makes modifications to self-attention update rule based on that. It also introduces a simpler, yet effective kernel function.
diff --git a/docs/transformers/fast_weights/readme.html b/docs/transformers/fast_weights/readme.html index 48763408..271418c9 100644 --- a/docs/transformers/fast_weights/readme.html +++ b/docs/transformers/fast_weights/readme.html @@ -69,7 +69,7 @@This is an annotated implementation of the paper -Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch.
+Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch.Here is the annotated implementation. Here are the training code and a notebook for training a fast weights transformer on the Tiny Shakespeare dataset.
diff --git a/docs/transformers/feed_forward.html b/docs/transformers/feed_forward.html index 392c315d..e3ae3e1e 100644 --- a/docs/transformers/feed_forward.html +++ b/docs/transformers/feed_forward.html @@ -84,7 +84,7 @@ GELU (Gaussian Error Linear Unit) activation is also used instead of ReLU. where $\Phi(x) = P(X \le x), X \sim \mathcal{N}(0,1)$This is a generic implementation that supports different variants including -Gated Linear Units (GLU). +Gated Linear Units (GLU). We have also implemented experiments on these:
labml.configs
This is a PyTorch implementation of the paper -Accessing Higher-level Representations in Sequential Transformers with Feedback Memory.
+Accessing Higher-level Representations in Sequential Transformers with Feedback Memory.Normal transformers process tokens in parallel. Each transformer layer pays attention to the outputs of the previous layer. Feedback transformer pays attention to the output of all layers in previous steps. diff --git a/docs/transformers/feedback/readme.html b/docs/transformers/feedback/readme.html index 077924c0..996466b1 100644 --- a/docs/transformers/feedback/readme.html +++ b/docs/transformers/feedback/readme.html @@ -69,7 +69,7 @@
This is a PyTorch implementation of the paper -Accessing Higher-level Representations in Sequential Transformers with Feedback Memory.
+Accessing Higher-level Representations in Sequential Transformers with Feedback Memory.Normal transformers process tokens in parallel. Each transformer layer pays attention to the outputs of the previous layer. Feedback transformer pays attention to the output of all layers in previous steps. diff --git a/docs/transformers/fnet/index.html b/docs/transformers/fnet/index.html index d39ec5c8..62d5e070 100644 --- a/docs/transformers/fnet/index.html +++ b/docs/transformers/fnet/index.html @@ -69,7 +69,7 @@
This is a PyTorch implementation of the paper -FNet: Mixing Tokens with Fourier Transforms.
+FNet: Mixing Tokens with Fourier Transforms.This paper replaces the self-attention layer with two Fourier transforms to mix tokens. diff --git a/docs/transformers/fnet/readme.html b/docs/transformers/fnet/readme.html index 40317673..f7c89ad3 100644 --- a/docs/transformers/fnet/readme.html +++ b/docs/transformers/fnet/readme.html @@ -69,7 +69,7 @@
This is a PyTorch implementation of the paper -FNet: Mixing Tokens with Fourier Transforms.
+FNet: Mixing Tokens with Fourier Transforms.This paper replaces the self-attention layer with two Fourier transforms to mix tokens. diff --git a/docs/transformers/index.html b/docs/transformers/index.html index a6051832..f92cd215 100644 --- a/docs/transformers/index.html +++ b/docs/transformers/index.html @@ -69,7 +69,7 @@
This module contains PyTorch implementations and explanations of original transformer -from paper Attention Is All You Need, +from paper Attention Is All You Need, and derivatives and enhancements of it.
This is an implementation of GPT-2 architecture.
This is an implementation of the paper -GLU Variants Improve Transformer.
+GLU Variants Improve Transformer.This is an implementation of the paper -Generalization through Memorization: Nearest Neighbor Language Models.
+Generalization through Memorization: Nearest Neighbor Language Models.This is an implementation of the paper -Accessing Higher-level Representations in Sequential Transformers with Feedback Memory.
+Accessing Higher-level Representations in Sequential Transformers with Feedback Memory.This is a miniature implementation of the paper -Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. +Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Our implementation only has a few million parameters and doesn’t do model parallel distributed training. It does single GPU training but we implement the concept of switching as described in the paper.
This is an implementation of the paper -Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch.
+Linear Transformers Are Secretly Fast Weight Memory Systems in PyTorch.This is an implementation of the paper -FNet: Mixing Tokens with Fourier Transforms.
+FNet: Mixing Tokens with Fourier Transforms.This is an implementation of the paper An Attention Free Transformer.
This is an implementation of Masked Language Model used for pre-training in paper -BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
+BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.This is an implementation of the paper MLP-Mixer: An all-MLP Architecture for Vision.
@@ -119,7 +119,7 @@ It does single GPU training but we implement the concept of switching as describ Pay Attention to MLPs.This is an implementation of the paper -An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale.
+An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale.92from .configs import TransformerConfigs
diff --git a/docs/transformers/knn/index.html b/docs/transformers/knn/index.html
index 6db8ee6f..df4ae1e0 100644
--- a/docs/transformers/knn/index.html
+++ b/docs/transformers/knn/index.html
@@ -69,7 +69,7 @@
This is a PyTorch implementation of the paper - Generalization through Memorization: Nearest Neighbor Language Models. + Generalization through Memorization: Nearest Neighbor Language Models. It uses k-nearest neighbors to improve perplexity of autoregressive transformer models.
An autoregressive language model estimates $p(w_t | \color{yellowgreen}{c_t})$, where $w_t$ is the token at step $t$ diff --git a/docs/transformers/mha.html b/docs/transformers/mha.html index 4fe146a6..73c5d512 100644 --- a/docs/transformers/mha.html +++ b/docs/transformers/mha.html @@ -68,7 +68,7 @@
This is a tutorial/implementation of multi-headed attention -from paper Attention Is All You Need +from paper Attention Is All You Need in PyTorch. The implementation is inspired from Annotated Transformer.
Here is the training code that uses a basic transformer diff --git a/docs/transformers/mlm/index.html b/docs/transformers/mlm/index.html index 143546de..228ccc13 100644 --- a/docs/transformers/mlm/index.html +++ b/docs/transformers/mlm/index.html @@ -70,7 +70,7 @@
This is a PyTorch implementation of the Masked Language Model (MLM) used to pre-train the BERT model introduced in the paper -BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
+BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.BERT model is a transformer model. The paper pre-trains the model using MLM and with next sentence prediction. diff --git a/docs/transformers/mlm/readme.html b/docs/transformers/mlm/readme.html index 76667336..552b5f9b 100644 --- a/docs/transformers/mlm/readme.html +++ b/docs/transformers/mlm/readme.html @@ -70,7 +70,7 @@
This is a PyTorch implementation of Masked Language Model (MLM) used to pre-train the BERT model introduced in the paper -BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
+BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.BERT model is a transformer model. The paper pre-trains the model using MLM and with next sentence prediction. diff --git a/docs/transformers/models.html b/docs/transformers/models.html index edfaf04c..27e2560b 100644 --- a/docs/transformers/models.html +++ b/docs/transformers/models.html @@ -179,7 +179,7 @@ and add the original residual vectors. Alternative is to do a layer normalization after adding the residuals. But we found this to be less stable when training. We found a detailed discussion about this in the paper - On Layer Normalization in the Transformer Architecture.
+ On Layer Normalization in the Transformer Architecture.59class TransformerLayer(Module):
This is a miniature PyTorch implementation of the paper -Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. +Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Our implementation only has a few million parameters and doesn’t do model parallel distributed training. It does single GPU training, but we implement the concept of switching as described in the paper.
The Switch Transformer uses different parameters for each token by switching among parameters diff --git a/docs/transformers/switch/readme.html b/docs/transformers/switch/readme.html index 90892c2d..fd5a3384 100644 --- a/docs/transformers/switch/readme.html +++ b/docs/transformers/switch/readme.html @@ -69,7 +69,7 @@
This is a miniature PyTorch implementation of the paper -Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. +Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Our implementation only has a few million parameters and doesn’t do model parallel distributed training. It does single GPU training, but we implement the concept of switching as described in the paper.
The Switch Transformer uses different parameters for each token by switching among parameters diff --git a/docs/transformers/vit/index.html b/docs/transformers/vit/index.html index 2bb20f87..b1ca1939 100644 --- a/docs/transformers/vit/index.html +++ b/docs/transformers/vit/index.html @@ -69,7 +69,7 @@
This is a PyTorch implementation of the paper -An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale.
+An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale.Vision transformer applies a pure transformer to images without any convolution layers. They split the image into patches and apply a transformer on patch embeddings. diff --git a/docs/transformers/vit/readme.html b/docs/transformers/vit/readme.html index dce47850..da2e721c 100644 --- a/docs/transformers/vit/readme.html +++ b/docs/transformers/vit/readme.html @@ -69,7 +69,7 @@
This is a PyTorch implementation of the paper -An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale.
+An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale.Vision transformer applies a pure transformer to images without any convolution layers. They split the image into patches and apply a transformer on patch embeddings. diff --git a/docs/transformers/xl/index.html b/docs/transformers/xl/index.html index dd4fc7ed..1de9266e 100644 --- a/docs/transformers/xl/index.html +++ b/docs/transformers/xl/index.html @@ -69,7 +69,7 @@
This is an implementation of -Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context +Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context in PyTorch.
Transformer has a limited attention span, equal to the length of the sequence trained in parallel. diff --git a/docs/transformers/xl/readme.html b/docs/transformers/xl/readme.html index 0f8aa57c..05a21c90 100644 --- a/docs/transformers/xl/readme.html +++ b/docs/transformers/xl/readme.html @@ -69,7 +69,7 @@
This is an implementation of -Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context +Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context in PyTorch.
Transformer has a limited attention span, equal to the length of the sequence trained in parallel. diff --git a/docs/transformers/xl/relative_mha.html b/docs/transformers/xl/relative_mha.html index dcd40d04..cf87181b 100644 --- a/docs/transformers/xl/relative_mha.html +++ b/docs/transformers/xl/relative_mha.html @@ -69,7 +69,7 @@
This is an implementation of relative multi-headed attention from paper -Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context +Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context in PyTorch.