This is a PyTorch implementation of Proximal Policy Optimization - PPO.
PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods one do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a singe sample causes problems because the policy deviates too much producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.
You can find an experiment that uses it here. The experiment uses Generalized Advantage Estimation.
26import torch
27
28from labml_helpers.module import Module
29from labml_nn.rl.ppo.gae import GAEHere’s how the PPO update rule is derived.
We want to maximize policy reward where $r$ is the reward, $\pi$ is the policy, $\tau$ is a trajectory sampled from policy, and $\gamma$ is the discount factor between $[0, 1]$.
So,
Define discounted-future state distribution,
Then,
Importance sampling $a$ from $\pi_{\theta_{OLD}}$,
Then we assume $d^\pi_\theta(s)$ and $d^\pi_{\theta_{OLD}}(s)$ are similar. The error we introduce to $J(\pi_\theta) - J(\pi_{\theta_{OLD}})$ by this assumtion is bound by the KL divergence between $\pi_\theta$ and $\pi_{\theta_{OLD}}$. Constrained Policy Optimization shows the proof of this. I haven’t read it.
32class ClippedPPOLoss(Module):133 def __init__(self):
134 super().__init__()136 def __call__(self, log_pi: torch.Tensor, sampled_log_pi: torch.Tensor,
137 advantage: torch.Tensor, clip: float) -> torch.Tensor:ratio $r_t(\theta) = \frac{\pi_\theta (a_t|s_t)}{\pi_{\theta_{OLD}} (a_t|s_t)}$; this is different from rewards $r_t$.
140 ratio = torch.exp(log_pi - sampled_log_pi)
The ratio is clipped to be close to 1. We take the minimum so that the gradient will only pull $\pi_\theta$ towards $\pi_{\theta_{OLD}}$ if the ratio is not between $1 - \epsilon$ and $1 + \epsilon$. This keeps the KL divergence between $\pi_\theta$ and $\pi_{\theta_{OLD}}$ constrained. Large deviation can cause performance collapse; where the policy performance drops and doesn’t recover because we are sampling from a bad policy.
Using the normalized advantage $\bar{A_t} = \frac{\hat{A_t} - \mu(\hat{A_t})}{\sigma(\hat{A_t})}$ introduces a bias to the policy gradient estimator, but it reduces variance a lot.
169 clipped_ratio = ratio.clamp(min=1.0 - clip,
170 max=1.0 + clip)
171 policy_reward = torch.min(ratio * advantage,
172 clipped_ratio * advantage)
173
174 self.clip_fraction = (abs((ratio - 1.0)) > clip).to(torch.float).mean()
175
176 return -policy_reward.mean()Similarly we clip the value function update also.
Clipping makes sure the value function $V_\theta$ doesn’t deviate significantly from $V_{\theta_{OLD}}$.
179class ClippedValueFunctionLoss(Module):200 def __call__(self, value: torch.Tensor, sampled_value: torch.Tensor, sampled_return: torch.Tensor, clip: float):
201 clipped_value = sampled_value + (value - sampled_value).clamp(min=-clip, max=clip)
202 vf_loss = torch.max((value - sampled_return) ** 2, (clipped_value - sampled_return) ** 2)
203 return 0.5 * vf_loss.mean()