mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-08-14 01:13:00 +08:00
capsnet readme
This commit is contained in:
@ -123,7 +123,7 @@
|
||||
\hat{A_t^{(\infty)}} &= r_t + \gamma r_{t+1} +\gamma^2 r_{t+1} + ... - V(s)
|
||||
\end{align}</script>
|
||||
</p>
|
||||
<p>$\hat{A_t^{(1)}}$ is high bias, low variance whilst
|
||||
<p>$\hat{A_t^{(1)}}$ is high bias, low variance, whilst
|
||||
$\hat{A_t^{(\infty)}}$ is unbiased, high variance.</p>
|
||||
<p>We take a weighted average of $\hat{A_t^{(k)}}$ to balance bias and variance.
|
||||
This is called Generalized Advantage Estimation.
|
||||
|
@ -76,9 +76,9 @@
|
||||
<p>This is a <a href="https://pytorch.org">PyTorch</a> implementation of
|
||||
<a href="https://arxiv.org/abs/1707.06347">Proximal Policy Optimization - PPO</a>.</p>
|
||||
<p>PPO is a policy gradient method for reinforcement learning.
|
||||
Simple policy gradient methods one do a single gradient update per sample (or a set of samples).
|
||||
Doing multiple gradient steps for a singe sample causes problems
|
||||
because the policy deviates too much producing a bad policy.
|
||||
Simple policy gradient methods do a single gradient update per sample (or a set of samples).
|
||||
Doing multiple gradient steps for a single sample causes problems
|
||||
because the policy deviates too much, producing a bad policy.
|
||||
PPO lets us do multiple gradient updates per sample by trying to keep the
|
||||
policy close to the policy that was used to sample data.
|
||||
It does so by clipping gradient flow if the updated policy
|
||||
@ -172,7 +172,7 @@ J(\pi_\theta) - J(\pi_{\theta_{OLD}})
|
||||
</p>
|
||||
<p>Then we assume $d^\pi_\theta(s)$ and $d^\pi_{\theta_{OLD}}(s)$ are similar.
|
||||
The error we introduce to $J(\pi_\theta) - J(\pi_{\theta_{OLD}})$
|
||||
by this assumtion is bound by the KL divergence between
|
||||
by this assumption is bound by the KL divergence between
|
||||
$\pi_\theta$ and $\pi_{\theta_{OLD}}$.
|
||||
<a href="https://arxiv.org/abs/1705.10528">Constrained Policy Optimization</a>
|
||||
shows the proof of this. I haven’t read it.</p>
|
||||
|
Reference in New Issue
Block a user