📚 ppo intro

2025-11-01 20:28:41 +08:00 · 2021-02-23 17:46:27 +05:30
parent 5442dfb130
commit c1e9b0ce32
6 changed files with 370 additions and 41 deletions
--- a/labml_nn/rl/ppo/init.py
+++ b/labml_nn/rl/ppo/init.py
@ -10,6 +10,15 @@ summary: >
 This is a [PyTorch](https://pytorch.org) implementation of
 [Proximal Policy Optimization - PPO](https://arxiv.org/abs/1707.06347).

+PPO is a policy gradient method for reinforcement learning.
+Simple policy gradient methods one do a single gradient update per sample (or a set of samples).
+Doing multiple gradient steps for a singe sample causes problems
+because the policy deviates too much producing a bad policy.
+PPO lets us do multiple gradient updates per sample by trying to keep the
+policy close to the policy that was used to sample data.
+It does so by clipping gradient flow if the updated policy
+is not close to the policy used to sample the data.
+
 You can find an experiment that uses it [here](experiment.html).
 The experiment uses [Generalized Advantage Estimation](gae.html).
 """
@ -24,6 +33,8 @@ class ClippedPPOLoss(Module):
    """
    ## PPO Loss

+    Here's how the PPO update rule is derived.
+
    We want to maximize policy reward
     $$\max_\theta J(\pi_\theta) =
       \mathop{\mathbb{E}}_{\tau \sim \pi_\theta}\Biggl[\sum_{t=0}^\infty \gamma^t r_t \Biggr]$$
@ -128,6 +139,8 @@ class ClippedPPOLoss(Module):
        # *this is different from rewards* $r_t$.
        ratio = torch.exp(log_pi - sampled_log_pi)

+        # ### Cliping the policy ratio
+        #
        # \begin{align}
        # \mathcal{L}^{CLIP}(\theta) =
        #  \mathbb{E}_{a_t, s_t \sim \pi_{\theta{OLD}}} \biggl[
@ -167,6 +180,8 @@ class ClippedValueFunctionLoss(Module):
    """
    ## Clipped Value Function Loss

+    Similarly we clip the value function update also.
+
    \begin{align}
    V^{\pi_\theta}_{CLIP}(s_t)
     &= clip\Bigl(V^{\pi_\theta}(s_t) - \hat{V_t}, -\epsilon, +\epsilon\Bigr)
--- a/labml_nn/rl/ppo/readme.md
+++ b/labml_nn/rl/ppo/readme.md
@ -0,0 +1,16 @@
+# [Proximal Policy Optimization (PPO)](https://nn.labml.ai/rl/ppo/index.html)
+
+This is a [PyTorch](https://pytorch.org) implementation of
+[Proximal Policy Optimization - PPO](https://arxiv.org/abs/1707.06347).
+
+PPO is a policy gradient method for reinforcement learning.
+Simple policy gradient methods one do a single gradient update per sample (or a set of samples).
+Doing multiple gradient steps for a singe sample causes problems
+because the policy deviates too much producing a bad policy.
+PPO lets us do multiple gradient updates per sample by trying to keep the
+policy close to the policy that was used to sample data.
+It does so by clipping gradient flow if the updated policy
+is not close to the policy used to sample the data.
+
+You can find an experiment that uses it [here](https://nn.labml.ai/rl/ppo/experiment.html).
+The experiment uses [Generalized Advantage Estimation](https://nn.labml.ai/rl/ppo/gae.html).