mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-08-16 02:41:38 +08:00
made a few changes
This commit is contained in:
@ -11,9 +11,9 @@ This is a [PyTorch](https://pytorch.org) implementation of
|
|||||||
[Proximal Policy Optimization - PPO](https://arxiv.org/abs/1707.06347).
|
[Proximal Policy Optimization - PPO](https://arxiv.org/abs/1707.06347).
|
||||||
|
|
||||||
PPO is a policy gradient method for reinforcement learning.
|
PPO is a policy gradient method for reinforcement learning.
|
||||||
Simple policy gradient methods one do a single gradient update per sample (or a set of samples).
|
Simple policy gradient methods do a single gradient update per sample (or a set of samples).
|
||||||
Doing multiple gradient steps for a singe sample causes problems
|
Doing multiple gradient steps for a single sample causes problems
|
||||||
because the policy deviates too much producing a bad policy.
|
because the policy deviates too much, producing a bad policy.
|
||||||
PPO lets us do multiple gradient updates per sample by trying to keep the
|
PPO lets us do multiple gradient updates per sample by trying to keep the
|
||||||
policy close to the policy that was used to sample data.
|
policy close to the policy that was used to sample data.
|
||||||
It does so by clipping gradient flow if the updated policy
|
It does so by clipping gradient flow if the updated policy
|
||||||
@ -107,7 +107,7 @@ class ClippedPPOLoss(Module):
|
|||||||
|
|
||||||
Then we assume $d^\pi_\theta(s)$ and $d^\pi_{\theta_{OLD}}(s)$ are similar.
|
Then we assume $d^\pi_\theta(s)$ and $d^\pi_{\theta_{OLD}}(s)$ are similar.
|
||||||
The error we introduce to $J(\pi_\theta) - J(\pi_{\theta_{OLD}})$
|
The error we introduce to $J(\pi_\theta) - J(\pi_{\theta_{OLD}})$
|
||||||
by this assumtion is bound by the KL divergence between
|
by this assumption is bound by the KL divergence between
|
||||||
$\pi_\theta$ and $\pi_{\theta_{OLD}}$.
|
$\pi_\theta$ and $\pi_{\theta_{OLD}}$.
|
||||||
[Constrained Policy Optimization](https://arxiv.org/abs/1705.10528)
|
[Constrained Policy Optimization](https://arxiv.org/abs/1705.10528)
|
||||||
shows the proof of this. I haven't read it.
|
shows the proof of this. I haven't read it.
|
||||||
|
Reference in New Issue
Block a user