mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-11-03 05:46:16 +08:00
paper url fix
This commit is contained in:
@ -1,6 +1,6 @@
|
||||
{
|
||||
"<h1>Deep Q Networks (DQN)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"model.html\">Dueling Network</a>, <a href=\"replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"experiment.html\">experiment</a> and <a href=\"model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN)</h1>\n<p>\u3053\u308c\u306f\u3001<a href=\"https://papers.labml.ai/paper/1312.5602\">\u30c7\u30a3\u30fc\u30d7\u5f37\u5316\u5b66\u7fd2\u3092\u4f7f\u3063\u305f\u30a2\u30bf\u30ea\u30d7\u30ec\u30a4\u3068\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af</a><a href=\"model.html\">\u3001<a href=\"replay_buffer.html\">\u512a\u5148\u30ea\u30d7\u30ec\u30a4</a>\u3001<a href=\"https://pytorch.org\">\u30c0\u30d6\u30ebQ\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092PyTorch\u3067\u5b9f\u88c5\u3057\u305f\u3082\u306e\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"experiment.html\"><a href=\"model.html\">\u3053\u308c\u304c\u5b9f\u9a13\u3068\u30e2\u30c7\u30eb\u306e\u5b9f\u88c5\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Train the model</h2>\n<p>We want to find optimal action-value function.</p>\n<span translate=no>_^_0_^_</span><h3>Target network \ud83c\udfaf</h3>\n<p>In order to improve stability we use experience replay that randomly sample from previous experience <span translate=no>_^_1_^_</span>. We also use a Q network with a separate set of parameters <span translate=no>_^_2_^_</span> to calculate the target. <span translate=no>_^_3_^_</span> is updated periodically. This is according to paper <a href=\"https://deepmind.com/research/dqn/\">Human Level Control Through Deep Reinforcement Learning</a>.</p>\n<p>So the loss function is, <span translate=no>_^_4_^_</span></p>\n<h3>Double <span translate=no>_^_5_^_</span>-Learning</h3>\n<p>The max operator in the above calculation uses same network for both selecting the best action and for evaluating the value. That is, <span translate=no>_^_6_^_</span> We use <a href=\"https://papers.labml.ai/paper/1509.06461\">double Q-learning</a>, where the <span translate=no>_^_7_^_</span> is taken from <span translate=no>_^_8_^_</span> and the value is taken from <span translate=no>_^_9_^_</span>.</p>\n<p>And the loss function becomes,</p>\n<span translate=no>_^_10_^_</span>": "<h2>\u30e2\u30c7\u30eb\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0</h2>\n<p>\u6700\u9069\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u30d0\u30ea\u30e5\u30fc\u95a2\u6570\u3092\u898b\u3064\u3051\u305f\u3044\u3002</p>\n<span translate=no>_^_0_^_</span><h3>\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af \ud83c\udfaf</h3>\n<p>\u5b89\u5b9a\u6027\u3092\u5411\u4e0a\u3055\u305b\u308b\u305f\u3081\u306b\u3001\u4ee5\u524d\u306e\u30a8\u30af\u30b9\u30da\u30ea\u30a8\u30f3\u30b9\u304b\u3089\u30e9\u30f3\u30c0\u30e0\u306b\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u308b\u30a8\u30af\u30b9\u30da\u30ea\u30a8\u30f3\u30b9\u306e\u30ea\u30d7\u30ec\u30a4\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059\u3002<span translate=no>_^_1_^_</span>\u307e\u305f\u3001<span translate=no>_^_2_^_</span>\u5225\u306e\u30d1\u30e9\u30e1\u30fc\u30bf\u30bb\u30c3\u30c8\u3092\u6301\u3064Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f7f\u7528\u3057\u3066\u30bf\u30fc\u30b2\u30c3\u30c8\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002<span translate=no>_^_3_^_</span>\u5b9a\u671f\u7684\u306b\u66f4\u65b0\u3055\u308c\u307e\u3059\u3002\u3053\u308c\u306f\u3001<a href=\"https://deepmind.com/research/dqn/\">\u6df1\u5c64\u5f37\u5316\u5b66\u7fd2\u306b\u3088\u308b\u30d2\u30e5\u30fc\u30de\u30f3\u30ec\u30d9\u30eb\u5236\u5fa1\u306e\u8ad6\u6587\u306b\u3088\u308b\u3082\u306e\u3067\u3059</a></p>\u3002\n<p>\u3057\u305f\u304c\u3063\u3066\u3001\u640d\u5931\u95a2\u6570\u306f\u3001<span translate=no>_^_4_^_</span></p>\n<h3><span translate=no>_^_5_^_</span>\u30c0\u30d6\u30eb\u30e9\u30fc\u30cb\u30f3\u30b0</h3>\n<p>\u4e0a\u306e\u8a08\u7b97\u306e max \u6f14\u7b97\u5b50\u306f\u3001\u6700\u9069\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u306e\u9078\u629e\u3068\u5024\u306e\u8a55\u4fa1\u306e\u4e21\u65b9\u306b\u540c\u3058\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f7f\u7528\u3057\u307e\u3059\u3002\u3064\u307e\u308a<span translate=no>_^_6_^_</span>\u3001<a href=\"https://papers.labml.ai/paper/1509.06461\"><span translate=no>_^_7_^_</span><span translate=no>_^_8_^_</span>\u306e\u53d6\u5f97\u5143\u3068\u5024\u306e\u53d6\u5f97\u5143\u3068\u3044\u3046\u4e8c\u91cdQ\u30e9\u30fc\u30cb\u30f3\u30b0\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a></p>\u3002<span translate=no>_^_9_^_</span>\n<p>\u305d\u3057\u3066\u3001\u640d\u5931\u95a2\u6570\u306f\u6b21\u306e\u3088\u3046\u306b\u306a\u308a\u307e\u3059\u3002</p>\n<span translate=no>_^_10_^_</span>",
|
||||
"<h1>Deep Q Networks (DQN)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://arxiv.org/abs/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"model.html\">Dueling Network</a>, <a href=\"replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"experiment.html\">experiment</a> and <a href=\"model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN)</h1>\n<p>\u3053\u308c\u306f\u3001<a href=\"https://arxiv.org/abs/1312.5602\">\u30c7\u30a3\u30fc\u30d7\u5f37\u5316\u5b66\u7fd2\u3092\u4f7f\u3063\u305f\u30a2\u30bf\u30ea\u30d7\u30ec\u30a4\u3068\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af</a><a href=\"model.html\">\u3001<a href=\"replay_buffer.html\">\u512a\u5148\u30ea\u30d7\u30ec\u30a4</a>\u3001<a href=\"https://pytorch.org\">\u30c0\u30d6\u30ebQ\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092PyTorch\u3067\u5b9f\u88c5\u3057\u305f\u3082\u306e\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"experiment.html\"><a href=\"model.html\">\u3053\u308c\u304c\u5b9f\u9a13\u3068\u30e2\u30c7\u30eb\u306e\u5b9f\u88c5\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Train the model</h2>\n<p>We want to find optimal action-value function.</p>\n<span translate=no>_^_0_^_</span><h3>Target network \ud83c\udfaf</h3>\n<p>In order to improve stability we use experience replay that randomly sample from previous experience <span translate=no>_^_1_^_</span>. We also use a Q network with a separate set of parameters <span translate=no>_^_2_^_</span> to calculate the target. <span translate=no>_^_3_^_</span> is updated periodically. This is according to paper <a href=\"https://deepmind.com/research/dqn/\">Human Level Control Through Deep Reinforcement Learning</a>.</p>\n<p>So the loss function is, <span translate=no>_^_4_^_</span></p>\n<h3>Double <span translate=no>_^_5_^_</span>-Learning</h3>\n<p>The max operator in the above calculation uses same network for both selecting the best action and for evaluating the value. That is, <span translate=no>_^_6_^_</span> We use <a href=\"https://arxiv.org/abs/1509.06461\">double Q-learning</a>, where the <span translate=no>_^_7_^_</span> is taken from <span translate=no>_^_8_^_</span> and the value is taken from <span translate=no>_^_9_^_</span>.</p>\n<p>And the loss function becomes,</p>\n<span translate=no>_^_10_^_</span>": "<h2>\u30e2\u30c7\u30eb\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0</h2>\n<p>\u6700\u9069\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u30d0\u30ea\u30e5\u30fc\u95a2\u6570\u3092\u898b\u3064\u3051\u305f\u3044\u3002</p>\n<span translate=no>_^_0_^_</span><h3>\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af \ud83c\udfaf</h3>\n<p>\u5b89\u5b9a\u6027\u3092\u5411\u4e0a\u3055\u305b\u308b\u305f\u3081\u306b\u3001\u4ee5\u524d\u306e\u30a8\u30af\u30b9\u30da\u30ea\u30a8\u30f3\u30b9\u304b\u3089\u30e9\u30f3\u30c0\u30e0\u306b\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u308b\u30a8\u30af\u30b9\u30da\u30ea\u30a8\u30f3\u30b9\u306e\u30ea\u30d7\u30ec\u30a4\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059\u3002<span translate=no>_^_1_^_</span>\u307e\u305f\u3001<span translate=no>_^_2_^_</span>\u5225\u306e\u30d1\u30e9\u30e1\u30fc\u30bf\u30bb\u30c3\u30c8\u3092\u6301\u3064Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f7f\u7528\u3057\u3066\u30bf\u30fc\u30b2\u30c3\u30c8\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002<span translate=no>_^_3_^_</span>\u5b9a\u671f\u7684\u306b\u66f4\u65b0\u3055\u308c\u307e\u3059\u3002\u3053\u308c\u306f\u3001<a href=\"https://deepmind.com/research/dqn/\">\u6df1\u5c64\u5f37\u5316\u5b66\u7fd2\u306b\u3088\u308b\u30d2\u30e5\u30fc\u30de\u30f3\u30ec\u30d9\u30eb\u5236\u5fa1\u306e\u8ad6\u6587\u306b\u3088\u308b\u3082\u306e\u3067\u3059</a></p>\u3002\n<p>\u3057\u305f\u304c\u3063\u3066\u3001\u640d\u5931\u95a2\u6570\u306f\u3001<span translate=no>_^_4_^_</span></p>\n<h3><span translate=no>_^_5_^_</span>\u30c0\u30d6\u30eb\u30e9\u30fc\u30cb\u30f3\u30b0</h3>\n<p>\u4e0a\u306e\u8a08\u7b97\u306e max \u6f14\u7b97\u5b50\u306f\u3001\u6700\u9069\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u306e\u9078\u629e\u3068\u5024\u306e\u8a55\u4fa1\u306e\u4e21\u65b9\u306b\u540c\u3058\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f7f\u7528\u3057\u307e\u3059\u3002\u3064\u307e\u308a<span translate=no>_^_6_^_</span>\u3001<a href=\"https://arxiv.org/abs/1509.06461\"><span translate=no>_^_7_^_</span><span translate=no>_^_8_^_</span>\u306e\u53d6\u5f97\u5143\u3068\u5024\u306e\u53d6\u5f97\u5143\u3068\u3044\u3046\u4e8c\u91cdQ\u30e9\u30fc\u30cb\u30f3\u30b0\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a></p>\u3002<span translate=no>_^_9_^_</span>\n<p>\u305d\u3057\u3066\u3001\u640d\u5931\u95a2\u6570\u306f\u6b21\u306e\u3088\u3046\u306b\u306a\u308a\u307e\u3059\u3002</p>\n<span translate=no>_^_10_^_</span>",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Calculate the desired Q value. We multiply by <span translate=no>_^_0_^_</span> to zero out the next state Q values if the game ended.</p>\n<p><span translate=no>_^_1_^_</span> </p>\n": "<p>\u76ee\u7684\u306e Q \u5024\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002\u30b2\u30fc\u30e0\u304c\u7d42\u4e86\u3057\u305f\u3089<span translate=no>_^_0_^_</span>\u3001\u3092\u639b\u3051\u3066\u6b21\u306e\u30b9\u30c6\u30fc\u30c8\u306eQ\u5024\u3092\u30bc\u30ed\u306b\u3057\u307e\u3059</p>\u3002\n<p><span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>Get the best action at state <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span> </p>\n": "<p>\u5dde\u3067\u6700\u9ad8\u306e\u30a2\u30af\u30b7\u30e7\u30f3\u3092 <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span></p>\n",
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
{
|
||||
"<h1>Deep Q Networks (DQN)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"model.html\">Dueling Network</a>, <a href=\"replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"experiment.html\">experiment</a> and <a href=\"model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"><span translate=no>_^_1_^_</span></a></p>\n": "<h1>\u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4Q \u0da2\u0dcf\u0dbd (DQN)</h1>\n<p>\u0db8\u0dd9\u0dba <a href=\"https://pytorch.org\">PyTorch</a> \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0d9a\u0dd2 <a href=\"https://papers.labml.ai/paper/1312.5602\">Atari \u0d9a\u0da9\u0daf\u0dcf\u0dc3\u0dd2 \u0dc3\u0dd9\u0dbd\u0dca\u0dbd\u0db8\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4 \u0dc1\u0d9a\u0dca\u0dad\u0dd2\u0db8\u0dad\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d89\u0d9c\u0dd9\u0db1\u0dd3\u0db8</a> \u0dc3\u0dc4 <a href=\"model.html\">\u0da9\u0dd4\u0dc0\u0dbd\u0dd2\u0d82 \u0da2\u0dcf\u0dbd\u0dba</a> , <a href=\"replay_buffer.html\">\u0db4\u0dca\u0dbb\u0db8\u0dd4\u0d9b\u0dad\u0dcf \u0db1\u0dd0\u0dc0\u0dad \u0db0\u0dcf\u0dc0\u0db1\u0dba</a> \u0dc3\u0dc4 \u0daf\u0dca\u0dc0\u0dd2\u0dad\u0dca\u0dc0 Q \u0da2\u0dcf\u0dbd\u0dba \u0dc3\u0db8\u0d9f. </p>\n<p>\u0db8\u0dd9\u0db1\u0dca\u0db1 <a href=\"experiment.html\">\u0d85\u0dad\u0dca\u0dc4\u0daf\u0dcf</a> \u0db6\u0dd0\u0dbd\u0dd3\u0db8 \u0dc3\u0dc4 <a href=\"model.html\">\u0d86\u0daf\u0dbb\u0dca\u0dc1</a> \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8. </p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"> <span translate=no>_^_1_^_</span></a></p>\n",
|
||||
"<h2>Train the model</h2>\n<p>We want to find optimal action-value function.</p>\n<span translate=no>_^_0_^_</span><h3>Target network \ud83c\udfaf</h3>\n<p>In order to improve stability we use experience replay that randomly sample from previous experience <span translate=no>_^_1_^_</span>. We also use a Q network with a separate set of parameters <span translate=no>_^_2_^_</span> to calculate the target. <span translate=no>_^_3_^_</span> is updated periodically. This is according to paper <a href=\"https://deepmind.com/research/dqn/\">Human Level Control Through Deep Reinforcement Learning</a>.</p>\n<p>So the loss function is, <span translate=no>_^_4_^_</span></p>\n<h3>Double <span translate=no>_^_5_^_</span>-Learning</h3>\n<p>The max operator in the above calculation uses same network for both selecting the best action and for evaluating the value. That is, <span translate=no>_^_6_^_</span> We use <a href=\"https://papers.labml.ai/paper/1509.06461\">double Q-learning</a>, where the <span translate=no>_^_7_^_</span> is taken from <span translate=no>_^_8_^_</span> and the value is taken from <span translate=no>_^_9_^_</span>.</p>\n<p>And the loss function becomes,</p>\n<span translate=no>_^_10_^_</span>": "<h2>\u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba \u0db4\u0dd4\u0dc4\u0dd4\u0dab\u0dd4 \u0d9a\u0dbb\u0db1\u0dca\u0db1</h2>\n<p>\u0db4\u0dca\u0dbb\u0dc1\u0dc3\u0dca\u0dad \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0d9a\u0dcf\u0dbb\u0dd3 \u0d85\u0d9c\u0dba \u0dc1\u0dca\u0dbb\u0dd2\u0dad\u0dba \u0dc3\u0ddc\u0dba\u0dcf \u0d9c\u0dd0\u0db1\u0dd3\u0db8\u0da7 \u0d85\u0db4\u0da7 \u0d85\u0dc0\u0dc1\u0dca\u0dba\u0dba.</p>\n<span translate=no>_^_0_^_</span><h3>\u0d89\u0dbd\u0d9a\u0dca\u0d9a \u0da2\u0dcf\u0dbd\u0dba \ud83c\udfaf</h3>\n<p>\u0dc3\u0dca\u0dae\u0dcf\u0dc0\u0dbb\u0dad\u0dca\u0dc0\u0dba \u0dc0\u0dd0\u0da9\u0dd2 \u0daf\u0dd2\u0dba\u0dd4\u0dab\u0dd4 \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0d85\u0db4\u0dd2 \u0db4\u0dd9\u0dbb \u0d85\u0dad\u0dca\u0daf\u0dd0\u0d9a\u0dd3\u0db8\u0dca \u0dc0\u0dbd\u0dd2\u0db1\u0dca \u0d85\u0dc4\u0db9\u0dd4 \u0dbd\u0dd9\u0dc3 \u0db1\u0dd2\u0dba\u0dd0\u0daf\u0dd2\u0dba \u0d85\u0dad\u0dca\u0daf\u0dd0\u0d9a\u0dd3\u0db8\u0dca \u0db1\u0dd0\u0dc0\u0dad \u0db0\u0dcf\u0dc0\u0db1\u0dba \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db8\u0dd4<span translate=no>_^_1_^_</span>. \u0d89\u0dbd\u0d9a\u0dca\u0d9a\u0dba \u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0dc0\u0dd9\u0db1\u0db8 \u0db4\u0dbb\u0dcf\u0db8\u0dd2\u0dad\u0dd3\u0db1\u0dca<span translate=no>_^_2_^_</span> \u0dc3\u0db8\u0dd6\u0dc4\u0dba\u0d9a\u0dca \u0dc3\u0dc4\u0dd2\u0dad Q \u0da2\u0dcf\u0dbd\u0dba\u0d9a\u0dca \u0daf \u0d85\u0db4\u0dd2 \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db8\u0dd4. <span translate=no>_^_3_^_</span>\u0dc0\u0dbb\u0dd2\u0db1\u0dca \u0dc0\u0dbb \u0dba\u0dcf\u0dc0\u0dad\u0dca\u0d9a\u0dcf\u0dbd\u0dd3\u0db1 \u0dc0\u0dda. \u0db8\u0dd9\u0dba \u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4 \u0dc1\u0d9a\u0dca\u0dad\u0dd2\u0db8\u0dad\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d89\u0d9c\u0dd9\u0db1\u0dd3\u0db8 \u0dad\u0dd4\u0dc5\u0dd2\u0db1\u0dca \u0d9a\u0da9\u0daf\u0dcf\u0dc3\u0dd2 <a href=\"https://deepmind.com/research/dqn/\">\u0db8\u0dcf\u0db1\u0dc0 \u0db8\u0da7\u0dca\u0da7\u0db8\u0dca \u0db4\u0dcf\u0dbd\u0db1\u0dba\u0da7 \u0d85\u0db1\u0dd4\u0dc0</a> \u0dba.</p>\n<p>\u0d91\u0db6\u0dd0\u0dc0\u0dd2\u0db1\u0dca \u0db4\u0dcf\u0da9\u0dd4 \u0dc1\u0dca\u0dbb\u0dd2\u0dad\u0dba \u0dc0\u0db1\u0dca\u0db1\u0dda,<span translate=no>_^_4_^_</span></p>\n<h3><span translate=no>_^_5_^_</span>\u0daf\u0dca\u0dc0\u0dd2\u0dad\u0dca\u0dc0-\u0d89\u0d9c\u0dd9\u0db1\u0dd4\u0db8\u0dca</h3>\n<p>\u0d89\u0dc4\u0dad \u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d8b\u0db4\u0dbb\u0dd2\u0db8 \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0d9a\u0dbb\u0dd4 \u0dc4\u0ddc\u0db3\u0db8 \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dc0 \u0dad\u0ddd\u0dbb\u0dcf \u0d9c\u0dd0\u0db1\u0dd3\u0db8 \u0dc3\u0dc4 \u0dc0\u0da7\u0dd2\u0db1\u0dcf\u0d9a\u0db8 \u0d87\u0d9c\u0dba\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0d91\u0d9a\u0db8 \u0da2\u0dcf\u0dbd\u0dba\u0d9a\u0dca \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0dba\u0dd2. \u0d91\u0db1\u0db8\u0dca,<span translate=no>_^_6_^_</span> \u0d85\u0db4\u0dd2 <a href=\"https://papers.labml.ai/paper/1509.06461\">\u0daf\u0dca\u0dc0\u0dd2\u0dad\u0dca\u0dc0 Q- \u0d89\u0d9c\u0dd9\u0db1\u0dd3\u0db8</a> \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db8\u0dd4,<span translate=no>_^_7_^_</span> \u0d91\u0dba \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1\u0dda \u0d9a\u0ddc\u0dad\u0dd0\u0db1\u0dd2\u0db1\u0dca\u0daf<span translate=no>_^_8_^_</span> \u0dc3\u0dc4 \u0dc0\u0da7\u0dd2\u0db1\u0dcf\u0d9a\u0db8 \u0dbd\u0db6\u0dcf<span translate=no>_^_9_^_</span> \u0d9c\u0db1\u0dd3.</p>\n<p>\u0db4\u0dcf\u0da9\u0dd4 \u0dc1\u0dca\u0dbb\u0dd2\u0dad\u0dba \u0db6\u0dc0\u0da7 \u0db4\u0dad\u0dca\u0dc0\u0dda,</p>\n<span translate=no>_^_10_^_</span>",
|
||||
"<h1>Deep Q Networks (DQN)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://arxiv.org/abs/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"model.html\">Dueling Network</a>, <a href=\"replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"experiment.html\">experiment</a> and <a href=\"model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"><span translate=no>_^_1_^_</span></a></p>\n": "<h1>\u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4Q \u0da2\u0dcf\u0dbd (DQN)</h1>\n<p>\u0db8\u0dd9\u0dba <a href=\"https://pytorch.org\">PyTorch</a> \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0d9a\u0dd2 <a href=\"https://arxiv.org/abs/1312.5602\">Atari \u0d9a\u0da9\u0daf\u0dcf\u0dc3\u0dd2 \u0dc3\u0dd9\u0dbd\u0dca\u0dbd\u0db8\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4 \u0dc1\u0d9a\u0dca\u0dad\u0dd2\u0db8\u0dad\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d89\u0d9c\u0dd9\u0db1\u0dd3\u0db8</a> \u0dc3\u0dc4 <a href=\"model.html\">\u0da9\u0dd4\u0dc0\u0dbd\u0dd2\u0d82 \u0da2\u0dcf\u0dbd\u0dba</a> , <a href=\"replay_buffer.html\">\u0db4\u0dca\u0dbb\u0db8\u0dd4\u0d9b\u0dad\u0dcf \u0db1\u0dd0\u0dc0\u0dad \u0db0\u0dcf\u0dc0\u0db1\u0dba</a> \u0dc3\u0dc4 \u0daf\u0dca\u0dc0\u0dd2\u0dad\u0dca\u0dc0 Q \u0da2\u0dcf\u0dbd\u0dba \u0dc3\u0db8\u0d9f. </p>\n<p>\u0db8\u0dd9\u0db1\u0dca\u0db1 <a href=\"experiment.html\">\u0d85\u0dad\u0dca\u0dc4\u0daf\u0dcf</a> \u0db6\u0dd0\u0dbd\u0dd3\u0db8 \u0dc3\u0dc4 <a href=\"model.html\">\u0d86\u0daf\u0dbb\u0dca\u0dc1</a> \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8. </p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"> <span translate=no>_^_1_^_</span></a></p>\n",
|
||||
"<h2>Train the model</h2>\n<p>We want to find optimal action-value function.</p>\n<span translate=no>_^_0_^_</span><h3>Target network \ud83c\udfaf</h3>\n<p>In order to improve stability we use experience replay that randomly sample from previous experience <span translate=no>_^_1_^_</span>. We also use a Q network with a separate set of parameters <span translate=no>_^_2_^_</span> to calculate the target. <span translate=no>_^_3_^_</span> is updated periodically. This is according to paper <a href=\"https://deepmind.com/research/dqn/\">Human Level Control Through Deep Reinforcement Learning</a>.</p>\n<p>So the loss function is, <span translate=no>_^_4_^_</span></p>\n<h3>Double <span translate=no>_^_5_^_</span>-Learning</h3>\n<p>The max operator in the above calculation uses same network for both selecting the best action and for evaluating the value. That is, <span translate=no>_^_6_^_</span> We use <a href=\"https://arxiv.org/abs/1509.06461\">double Q-learning</a>, where the <span translate=no>_^_7_^_</span> is taken from <span translate=no>_^_8_^_</span> and the value is taken from <span translate=no>_^_9_^_</span>.</p>\n<p>And the loss function becomes,</p>\n<span translate=no>_^_10_^_</span>": "<h2>\u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba \u0db4\u0dd4\u0dc4\u0dd4\u0dab\u0dd4 \u0d9a\u0dbb\u0db1\u0dca\u0db1</h2>\n<p>\u0db4\u0dca\u0dbb\u0dc1\u0dc3\u0dca\u0dad \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0d9a\u0dcf\u0dbb\u0dd3 \u0d85\u0d9c\u0dba \u0dc1\u0dca\u0dbb\u0dd2\u0dad\u0dba \u0dc3\u0ddc\u0dba\u0dcf \u0d9c\u0dd0\u0db1\u0dd3\u0db8\u0da7 \u0d85\u0db4\u0da7 \u0d85\u0dc0\u0dc1\u0dca\u0dba\u0dba.</p>\n<span translate=no>_^_0_^_</span><h3>\u0d89\u0dbd\u0d9a\u0dca\u0d9a \u0da2\u0dcf\u0dbd\u0dba \ud83c\udfaf</h3>\n<p>\u0dc3\u0dca\u0dae\u0dcf\u0dc0\u0dbb\u0dad\u0dca\u0dc0\u0dba \u0dc0\u0dd0\u0da9\u0dd2 \u0daf\u0dd2\u0dba\u0dd4\u0dab\u0dd4 \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0d85\u0db4\u0dd2 \u0db4\u0dd9\u0dbb \u0d85\u0dad\u0dca\u0daf\u0dd0\u0d9a\u0dd3\u0db8\u0dca \u0dc0\u0dbd\u0dd2\u0db1\u0dca \u0d85\u0dc4\u0db9\u0dd4 \u0dbd\u0dd9\u0dc3 \u0db1\u0dd2\u0dba\u0dd0\u0daf\u0dd2\u0dba \u0d85\u0dad\u0dca\u0daf\u0dd0\u0d9a\u0dd3\u0db8\u0dca \u0db1\u0dd0\u0dc0\u0dad \u0db0\u0dcf\u0dc0\u0db1\u0dba \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db8\u0dd4<span translate=no>_^_1_^_</span>. \u0d89\u0dbd\u0d9a\u0dca\u0d9a\u0dba \u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0dc0\u0dd9\u0db1\u0db8 \u0db4\u0dbb\u0dcf\u0db8\u0dd2\u0dad\u0dd3\u0db1\u0dca<span translate=no>_^_2_^_</span> \u0dc3\u0db8\u0dd6\u0dc4\u0dba\u0d9a\u0dca \u0dc3\u0dc4\u0dd2\u0dad Q \u0da2\u0dcf\u0dbd\u0dba\u0d9a\u0dca \u0daf \u0d85\u0db4\u0dd2 \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db8\u0dd4. <span translate=no>_^_3_^_</span>\u0dc0\u0dbb\u0dd2\u0db1\u0dca \u0dc0\u0dbb \u0dba\u0dcf\u0dc0\u0dad\u0dca\u0d9a\u0dcf\u0dbd\u0dd3\u0db1 \u0dc0\u0dda. \u0db8\u0dd9\u0dba \u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4 \u0dc1\u0d9a\u0dca\u0dad\u0dd2\u0db8\u0dad\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d89\u0d9c\u0dd9\u0db1\u0dd3\u0db8 \u0dad\u0dd4\u0dc5\u0dd2\u0db1\u0dca \u0d9a\u0da9\u0daf\u0dcf\u0dc3\u0dd2 <a href=\"https://deepmind.com/research/dqn/\">\u0db8\u0dcf\u0db1\u0dc0 \u0db8\u0da7\u0dca\u0da7\u0db8\u0dca \u0db4\u0dcf\u0dbd\u0db1\u0dba\u0da7 \u0d85\u0db1\u0dd4\u0dc0</a> \u0dba.</p>\n<p>\u0d91\u0db6\u0dd0\u0dc0\u0dd2\u0db1\u0dca \u0db4\u0dcf\u0da9\u0dd4 \u0dc1\u0dca\u0dbb\u0dd2\u0dad\u0dba \u0dc0\u0db1\u0dca\u0db1\u0dda,<span translate=no>_^_4_^_</span></p>\n<h3><span translate=no>_^_5_^_</span>\u0daf\u0dca\u0dc0\u0dd2\u0dad\u0dca\u0dc0-\u0d89\u0d9c\u0dd9\u0db1\u0dd4\u0db8\u0dca</h3>\n<p>\u0d89\u0dc4\u0dad \u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d8b\u0db4\u0dbb\u0dd2\u0db8 \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0d9a\u0dbb\u0dd4 \u0dc4\u0ddc\u0db3\u0db8 \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dc0 \u0dad\u0ddd\u0dbb\u0dcf \u0d9c\u0dd0\u0db1\u0dd3\u0db8 \u0dc3\u0dc4 \u0dc0\u0da7\u0dd2\u0db1\u0dcf\u0d9a\u0db8 \u0d87\u0d9c\u0dba\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0d91\u0d9a\u0db8 \u0da2\u0dcf\u0dbd\u0dba\u0d9a\u0dca \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0dba\u0dd2. \u0d91\u0db1\u0db8\u0dca,<span translate=no>_^_6_^_</span> \u0d85\u0db4\u0dd2 <a href=\"https://arxiv.org/abs/1509.06461\">\u0daf\u0dca\u0dc0\u0dd2\u0dad\u0dca\u0dc0 Q- \u0d89\u0d9c\u0dd9\u0db1\u0dd3\u0db8</a> \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db8\u0dd4,<span translate=no>_^_7_^_</span> \u0d91\u0dba \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1\u0dda \u0d9a\u0ddc\u0dad\u0dd0\u0db1\u0dd2\u0db1\u0dca\u0daf<span translate=no>_^_8_^_</span> \u0dc3\u0dc4 \u0dc0\u0da7\u0dd2\u0db1\u0dcf\u0d9a\u0db8 \u0dbd\u0db6\u0dcf<span translate=no>_^_9_^_</span> \u0d9c\u0db1\u0dd3.</p>\n<p>\u0db4\u0dcf\u0da9\u0dd4 \u0dc1\u0dca\u0dbb\u0dd2\u0dad\u0dba \u0db6\u0dc0\u0da7 \u0db4\u0dad\u0dca\u0dc0\u0dda,</p>\n<span translate=no>_^_10_^_</span>",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span> </p>\n",
|
||||
"<p>Calculate the desired Q value. We multiply by <span translate=no>_^_0_^_</span> to zero out the next state Q values if the game ended.</p>\n<p><span translate=no>_^_1_^_</span> </p>\n": "<p>\u0d85\u0db4\u0dda\u0d9a\u0dca\u0dc2\u0dd2\u0dadQ \u0d85\u0d9c\u0dba \u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dbb\u0db1\u0dca\u0db1. \u0d9a\u0dca\u0dbb\u0dd3\u0da9\u0dcf\u0dc0 \u0d85\u0dc0\u0dc3\u0db1\u0dca \u0dc0\u0dd6\u0dba\u0dda \u0db1\u0db8\u0dca \u0d8a\u0dc5\u0d9f \u0dbb\u0dcf\u0da2\u0dca\u0dba Q \u0d85\u0d9c\u0dba\u0db1\u0dca \u0dc1\u0dd4\u0db1\u0dca\u0dba <span translate=no>_^_0_^_</span> \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 \u0d85\u0db4\u0dd2 \u0d9c\u0dd4\u0dab \u0d9a\u0dbb\u0db8\u0dd4. </p>\n<p><span translate=no>_^_1_^_</span> </p>\n",
|
||||
"<p>Get the best action at state <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span> </p>\n": "<p>\u0dbb\u0dcf\u0da2\u0dca\u0dba\u0dba\u0dd9\u0db1\u0dca\u0dc4\u0ddc\u0db3\u0db8 \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dc0 \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dca\u0db1 <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span> </p>\n",
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
{
|
||||
"<h1>Deep Q Networks (DQN)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"model.html\">Dueling Network</a>, <a href=\"replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"experiment.html\">experiment</a> and <a href=\"model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u6df1\u5ea6 Q \u7f51\u7edc (DQN)</h1>\n<p>\u8fd9\u662f <a href=\"https://pytorch.org\">PyTorch</a> \u5b9e\u73b0\u7684 PyTorch <a href=\"https://papers.labml.ai/paper/1312.5602\">\u4f7f\u7528\u6df1\u5ea6\u5f3a\u5316\u5b66\u4e60\u73a9\u96c5</a>\u8fbe\u5229\u4ee5\u53ca<a href=\"model.html\">\u51b3\u6597\u7f51\u7edc</a>\u3001<a href=\"replay_buffer.html\">\u4f18\u5148\u56de\u653e</a>\u548c Double Q Network\u3002</p>\n<p>\u8fd9\u662f<a href=\"experiment.html\">\u5b9e\u9a8c</a>\u548c<a href=\"model.html\">\u6a21\u578b</a>\u5b9e\u73b0\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Train the model</h2>\n<p>We want to find optimal action-value function.</p>\n<span translate=no>_^_0_^_</span><h3>Target network \ud83c\udfaf</h3>\n<p>In order to improve stability we use experience replay that randomly sample from previous experience <span translate=no>_^_1_^_</span>. We also use a Q network with a separate set of parameters <span translate=no>_^_2_^_</span> to calculate the target. <span translate=no>_^_3_^_</span> is updated periodically. This is according to paper <a href=\"https://deepmind.com/research/dqn/\">Human Level Control Through Deep Reinforcement Learning</a>.</p>\n<p>So the loss function is, <span translate=no>_^_4_^_</span></p>\n<h3>Double <span translate=no>_^_5_^_</span>-Learning</h3>\n<p>The max operator in the above calculation uses same network for both selecting the best action and for evaluating the value. That is, <span translate=no>_^_6_^_</span> We use <a href=\"https://papers.labml.ai/paper/1509.06461\">double Q-learning</a>, where the <span translate=no>_^_7_^_</span> is taken from <span translate=no>_^_8_^_</span> and the value is taken from <span translate=no>_^_9_^_</span>.</p>\n<p>And the loss function becomes,</p>\n<span translate=no>_^_10_^_</span>": "<h2>\u8bad\u7ec3\u6a21\u578b</h2>\n<p>\u6211\u4eec\u60f3\u627e\u5230\u6700\u4f73\u7684\u52a8\u4f5c\u503c\u51fd\u6570\u3002</p>\n<span translate=no>_^_0_^_</span><h3>\u76ee\u6807\u7f51\u7edc \ud83c\udfaf</h3>\n<p>\u4e3a\u4e86\u63d0\u9ad8\u7a33\u5b9a\u6027\uff0c\u6211\u4eec\u4f7f\u7528\u7ecf\u9a8c\u56de\u653e\uff0c\u4ece\u4ee5\u524d\u7684\u7ecf\u9a8c\u4e2d\u968f\u673a\u62bd\u6837<span translate=no>_^_1_^_</span>\u3002\u6211\u4eec\u8fd8\u4f7f\u7528\u5177\u6709\u4e00\u7ec4\u5355\u72ec\u53c2\u6570\u7684 Q \u7f51\u7edc<span translate=no>_^_2_^_</span>\u6765\u8ba1\u7b97\u76ee\u6807\u3002<span translate=no>_^_3_^_</span>\u5b9a\u671f\u66f4\u65b0\u3002\u8fd9\u662f\u6839\u636e\u8bba\u6587\u300a\u901a\u8fc7\u6df1\u5ea6\u5f3a\u5316\u5b66\u4e60\u8fdb\u884c<a href=\"https://deepmind.com/research/dqn/\">\u4eba\u4f53\u6c34\u5e73\u63a7\u5236</a>\u300b\u5f97\u51fa\u7684\u3002</p>\n<p>\u6240\u4ee5\u635f\u5931\u51fd\u6570\u662f\uff0c<span translate=no>_^_4_^_</span></p>\n<h3>\u53cc<span translate=no>_^_5_^_</span>\u91cd\u5b66\u4e60</h3>\n<p>\u4e0a\u8ff0\u8ba1\u7b97\u4e2d\u7684\u6700\u5927\u503c\u8fd0\u7b97\u7b26\u4f7f\u7528\u76f8\u540c\u7684\u7f51\u7edc\u6765\u9009\u62e9\u6700\u4f73\u52a8\u4f5c\u548c\u8bc4\u4f30\u503c\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c<span translate=no>_^_6_^_</span>\u6211\u4eec\u4f7f\u7528<a href=\"https://papers.labml.ai/paper/1509.06461\">\u53cc\u91cdQ-L</a><span translate=no>_^_7_^_</span> earning<span translate=no>_^_8_^_</span>\uff0c\u5176\u4e2d\u53d6\u81ea\u503c\uff0c\u53d6\u81ea\u503c<span translate=no>_^_9_^_</span>\u3002</p>\n<p>\u635f\u5931\u51fd\u6570\u53d8\u6210\uff0c</p>\n<span translate=no>_^_10_^_</span>",
|
||||
"<h1>Deep Q Networks (DQN)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://arxiv.org/abs/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"model.html\">Dueling Network</a>, <a href=\"replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"experiment.html\">experiment</a> and <a href=\"model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u6df1\u5ea6 Q \u7f51\u7edc (DQN)</h1>\n<p>\u8fd9\u662f <a href=\"https://pytorch.org\">PyTorch</a> \u5b9e\u73b0\u7684 PyTorch <a href=\"https://arxiv.org/abs/1312.5602\">\u4f7f\u7528\u6df1\u5ea6\u5f3a\u5316\u5b66\u4e60\u73a9\u96c5</a>\u8fbe\u5229\u4ee5\u53ca<a href=\"model.html\">\u51b3\u6597\u7f51\u7edc</a>\u3001<a href=\"replay_buffer.html\">\u4f18\u5148\u56de\u653e</a>\u548c Double Q Network\u3002</p>\n<p>\u8fd9\u662f<a href=\"experiment.html\">\u5b9e\u9a8c</a>\u548c<a href=\"model.html\">\u6a21\u578b</a>\u5b9e\u73b0\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Train the model</h2>\n<p>We want to find optimal action-value function.</p>\n<span translate=no>_^_0_^_</span><h3>Target network \ud83c\udfaf</h3>\n<p>In order to improve stability we use experience replay that randomly sample from previous experience <span translate=no>_^_1_^_</span>. We also use a Q network with a separate set of parameters <span translate=no>_^_2_^_</span> to calculate the target. <span translate=no>_^_3_^_</span> is updated periodically. This is according to paper <a href=\"https://deepmind.com/research/dqn/\">Human Level Control Through Deep Reinforcement Learning</a>.</p>\n<p>So the loss function is, <span translate=no>_^_4_^_</span></p>\n<h3>Double <span translate=no>_^_5_^_</span>-Learning</h3>\n<p>The max operator in the above calculation uses same network for both selecting the best action and for evaluating the value. That is, <span translate=no>_^_6_^_</span> We use <a href=\"https://arxiv.org/abs/1509.06461\">double Q-learning</a>, where the <span translate=no>_^_7_^_</span> is taken from <span translate=no>_^_8_^_</span> and the value is taken from <span translate=no>_^_9_^_</span>.</p>\n<p>And the loss function becomes,</p>\n<span translate=no>_^_10_^_</span>": "<h2>\u8bad\u7ec3\u6a21\u578b</h2>\n<p>\u6211\u4eec\u60f3\u627e\u5230\u6700\u4f73\u7684\u52a8\u4f5c\u503c\u51fd\u6570\u3002</p>\n<span translate=no>_^_0_^_</span><h3>\u76ee\u6807\u7f51\u7edc \ud83c\udfaf</h3>\n<p>\u4e3a\u4e86\u63d0\u9ad8\u7a33\u5b9a\u6027\uff0c\u6211\u4eec\u4f7f\u7528\u7ecf\u9a8c\u56de\u653e\uff0c\u4ece\u4ee5\u524d\u7684\u7ecf\u9a8c\u4e2d\u968f\u673a\u62bd\u6837<span translate=no>_^_1_^_</span>\u3002\u6211\u4eec\u8fd8\u4f7f\u7528\u5177\u6709\u4e00\u7ec4\u5355\u72ec\u53c2\u6570\u7684 Q \u7f51\u7edc<span translate=no>_^_2_^_</span>\u6765\u8ba1\u7b97\u76ee\u6807\u3002<span translate=no>_^_3_^_</span>\u5b9a\u671f\u66f4\u65b0\u3002\u8fd9\u662f\u6839\u636e\u8bba\u6587\u300a\u901a\u8fc7\u6df1\u5ea6\u5f3a\u5316\u5b66\u4e60\u8fdb\u884c<a href=\"https://deepmind.com/research/dqn/\">\u4eba\u4f53\u6c34\u5e73\u63a7\u5236</a>\u300b\u5f97\u51fa\u7684\u3002</p>\n<p>\u6240\u4ee5\u635f\u5931\u51fd\u6570\u662f\uff0c<span translate=no>_^_4_^_</span></p>\n<h3>\u53cc<span translate=no>_^_5_^_</span>\u91cd\u5b66\u4e60</h3>\n<p>\u4e0a\u8ff0\u8ba1\u7b97\u4e2d\u7684\u6700\u5927\u503c\u8fd0\u7b97\u7b26\u4f7f\u7528\u76f8\u540c\u7684\u7f51\u7edc\u6765\u9009\u62e9\u6700\u4f73\u52a8\u4f5c\u548c\u8bc4\u4f30\u503c\u3002\u4e5f\u5c31\u662f\u8bf4\uff0c<span translate=no>_^_6_^_</span>\u6211\u4eec\u4f7f\u7528<a href=\"https://arxiv.org/abs/1509.06461\">\u53cc\u91cdQ-L</a><span translate=no>_^_7_^_</span> earning<span translate=no>_^_8_^_</span>\uff0c\u5176\u4e2d\u53d6\u81ea\u503c\uff0c\u53d6\u81ea\u503c<span translate=no>_^_9_^_</span>\u3002</p>\n<p>\u635f\u5931\u51fd\u6570\u53d8\u6210\uff0c</p>\n<span translate=no>_^_10_^_</span>",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Calculate the desired Q value. We multiply by <span translate=no>_^_0_^_</span> to zero out the next state Q values if the game ended.</p>\n<p><span translate=no>_^_1_^_</span> </p>\n": "<p>\u8ba1\u7b97\u6240\u9700\u7684 Q \u503c\u3002\u5982\u679c\u6e38\u620f\u7ed3\u675f\uff0c\u6211\u4eec\u5c06\u4e58<span translate=no>_^_0_^_</span>\u4ee5\u5c06\u4e0b\u4e00\u4e2a\u72b6\u6001 Q \u503c\u5f52\u96f6\u3002</p>\n<p><span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>Get the best action at state <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span> </p>\n": "<p>\u5728\u5dde\u5185\u91c7\u53d6\u6700\u4f73\u884c\u52a8<span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span></p>\n",
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
{
|
||||
"<h1>Deep Q Network (DQN) Model</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN) \u30e2\u30c7\u30eb</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Dueling Network \u2694\ufe0f Model for <span translate=no>_^_0_^_</span> Values</h2>\n<p>We are using a <a href=\"https://papers.labml.ai/paper/1511.06581\">dueling network</a> to calculate Q-values. Intuition behind dueling network architecture is that in most states the action doesn't matter, and in some states the action is significant. Dueling network allows this to be represented very well.</p>\n<span translate=no>_^_1_^_</span><p>So we create two networks for <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> and get <span translate=no>_^_4_^_</span> from them. <span translate=no>_^_5_^_</span> We share the initial layers of the <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> networks.</p>\n": "<h2>\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af \u2694\ufe0f \u4fa1\u5024\u30e2\u30c7\u30eb <span translate=no>_^_0_^_</span></h2>\n<p><a href=\"https://papers.labml.ai/paper/1511.06581\">Q\u5024\u306e\u8a08\u7b97\u306b\u306f\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a>\u3002\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u30a2\u30fc\u30ad\u30c6\u30af\u30c1\u30e3\u306e\u80cc\u5f8c\u306b\u3042\u308b\u76f4\u611f\u306f\u3001\u307b\u3068\u3093\u3069\u306e\u5dde\u3067\u306f\u30a2\u30af\u30b7\u30e7\u30f3\u306f\u91cd\u8981\u3067\u306f\u306a\u304f\u3001\u4e00\u90e8\u306e\u5dde\u3067\u306f\u30a2\u30af\u30b7\u30e7\u30f3\u304c\u91cd\u8981\u3067\u3042\u308b\u3068\u3044\u3046\u3053\u3068\u3067\u3059\u3002\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3067\u306f\u3001\u3053\u308c\u3092\u975e\u5e38\u306b\u3088\u304f\u8868\u73fe\u3067\u304d\u307e\u3059</p>\u3002\n<span translate=no>_^_1_^_</span><p>\u305d\u3053\u3067\u3001<span translate=no>_^_2_^_</span><span translate=no>_^_3_^_</span>\u3068\u304b\u3089\u306e 2 \u3064\u306e\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f5c\u6210\u3057\u3066\u3001\u305d\u306e 2 <span translate=no>_^_4_^_</span> \u3064\u306e\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u304b\u3089\u53d6\u5f97\u3057\u307e\u3059\u3002<span translate=no>_^_5_^_</span><span translate=no>_^_6_^_</span><span translate=no>_^_7_^_</span>\u3068\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u306e\u521d\u671f\u30ec\u30a4\u30e4\u30fc\u3092\u5171\u6709\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<h2>Dueling Network \u2694\ufe0f Model for <span translate=no>_^_0_^_</span> Values</h2>\n<p>We are using a <a href=\"https://arxiv.org/abs/1511.06581\">dueling network</a> to calculate Q-values. Intuition behind dueling network architecture is that in most states the action doesn't matter, and in some states the action is significant. Dueling network allows this to be represented very well.</p>\n<span translate=no>_^_1_^_</span><p>So we create two networks for <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> and get <span translate=no>_^_4_^_</span> from them. <span translate=no>_^_5_^_</span> We share the initial layers of the <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> networks.</p>\n": "<h2>\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af \u2694\ufe0f \u4fa1\u5024\u30e2\u30c7\u30eb <span translate=no>_^_0_^_</span></h2>\n<p><a href=\"https://arxiv.org/abs/1511.06581\">Q\u5024\u306e\u8a08\u7b97\u306b\u306f\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a>\u3002\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u30a2\u30fc\u30ad\u30c6\u30af\u30c1\u30e3\u306e\u80cc\u5f8c\u306b\u3042\u308b\u76f4\u611f\u306f\u3001\u307b\u3068\u3093\u3069\u306e\u5dde\u3067\u306f\u30a2\u30af\u30b7\u30e7\u30f3\u306f\u91cd\u8981\u3067\u306f\u306a\u304f\u3001\u4e00\u90e8\u306e\u5dde\u3067\u306f\u30a2\u30af\u30b7\u30e7\u30f3\u304c\u91cd\u8981\u3067\u3042\u308b\u3068\u3044\u3046\u3053\u3068\u3067\u3059\u3002\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3067\u306f\u3001\u3053\u308c\u3092\u975e\u5e38\u306b\u3088\u304f\u8868\u73fe\u3067\u304d\u307e\u3059</p>\u3002\n<span translate=no>_^_1_^_</span><p>\u305d\u3053\u3067\u3001<span translate=no>_^_2_^_</span><span translate=no>_^_3_^_</span>\u3068\u304b\u3089\u306e 2 \u3064\u306e\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f5c\u6210\u3057\u3066\u3001\u305d\u306e 2 <span translate=no>_^_4_^_</span> \u3064\u306e\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u304b\u3089\u53d6\u5f97\u3057\u307e\u3059\u3002<span translate=no>_^_5_^_</span><span translate=no>_^_6_^_</span><span translate=no>_^_7_^_</span>\u3068\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u306e\u521d\u671f\u30ec\u30a4\u30e4\u30fc\u3092\u5171\u6709\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>A fully connected layer takes the flattened frame from third convolution layer, and outputs <span translate=no>_^_0_^_</span> features </p>\n": "<p>\u5b8c\u5168\u306b\u63a5\u7d9a\u3055\u308c\u305f\u30ec\u30a4\u30e4\u30fc\u306f\u30013 \u756a\u76ee\u306e\u30b3\u30f3\u30dc\u30ea\u30e5\u30fc\u30b7\u30e7\u30f3\u30ec\u30a4\u30e4\u30fc\u304b\u3089\u30d5\u30e9\u30c3\u30c8\u5316\u3055\u308c\u305f\u30d5\u30ec\u30fc\u30e0\u3092\u53d6\u308a\u51fa\u3057\u3001\u30d5\u30a3\u30fc\u30c1\u30e3\u3092\u51fa\u529b\u3057\u307e\u3059\u3002<span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Convolution </p>\n": "<p>\u30b3\u30f3\u30dc\u30ea\u30e5\u30fc\u30b7\u30e7\u30f3</p>\n",
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
{
|
||||
"<h1>Deep Q Network (DQN) Model</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"><span translate=no>_^_1_^_</span></a></p>\n": "<h1>\u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4Q \u0da2\u0dcf\u0dbd (DQN) \u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"> <span translate=no>_^_1_^_</span></a></p>\n",
|
||||
"<h2>Dueling Network \u2694\ufe0f Model for <span translate=no>_^_0_^_</span> Values</h2>\n<p>We are using a <a href=\"https://papers.labml.ai/paper/1511.06581\">dueling network</a> to calculate Q-values. Intuition behind dueling network architecture is that in most states the action doesn't matter, and in some states the action is significant. Dueling network allows this to be represented very well.</p>\n<span translate=no>_^_1_^_</span><p>So we create two networks for <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> and get <span translate=no>_^_4_^_</span> from them. <span translate=no>_^_5_^_</span> We share the initial layers of the <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> networks.</p>\n": "<h2>\u0da2\u0dcf\u0dbd\u0dbaDueling \u2694\ufe0f <span translate=no>_^_0_^_</span> \u0dc0\u0da7\u0dd2\u0db1\u0dcf\u0d9a\u0db8\u0dca \u0dc3\u0db3\u0dc4\u0dcf \u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba</h2>\n<p>Q-\u0d85\u0d9c\u0dba\u0db1\u0dca\u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0d85\u0db4\u0dd2 <a href=\"https://papers.labml.ai/paper/1511.06581\">\u0da9\u0dd6\u0dbd\u0dd2\u0d82 \u0da2\u0dcf\u0dbd\u0dba\u0d9a\u0dca</a> \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db8\u0dd4. \u0da2\u0dcf\u0dbd \u0d9c\u0dd8\u0dc4 \u0db1\u0dd2\u0dbb\u0dca\u0db8\u0dcf\u0dab \u0dc1\u0dd2\u0dbd\u0dca\u0db4\u0dba dueling \u0db4\u0dd2\u0da7\u0dd4\u0db4\u0dc3 \u0d87\u0dad\u0dd2 \u0db4\u0dca\u0dbb\u0dad\u0dd2\u0db7\u0dcf\u0db1\u0dba \u0db1\u0db8\u0dca, \u0db6\u0ddc\u0dc4\u0ddd \u0db4\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dad\u0dc0\u0dbd \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dc0 \u0dc0\u0dd0\u0daf\u0d9c\u0dad\u0dca \u0db1\u0ddc\u0dc0\u0db1 \u0d85\u0dad\u0dbb \u0dc3\u0db8\u0dc4\u0dbb \u0db4\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dad\u0dc0\u0dbd \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dc0 \u0dc3\u0dd0\u0dbd\u0d9a\u0dd2\u0dba \u0dba\u0dd4\u0dad\u0dd4 \u0dba. Dueling \u0da2\u0dcf\u0dbd\u0dba \u0db8\u0dd9\u0dba \u0d89\u0dad\u0dcf \u0dc4\u0ddc\u0db3\u0dd2\u0db1\u0dca \u0db1\u0dd2\u0dbb\u0dd6\u0db4\u0dab\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 \u0d89\u0da9 \u0daf\u0dd9\u0dba\u0dd2. </p>\n<span translate=no>_^_1_^_</span><p>\u0d91\u0db6\u0dd0\u0dc0\u0dd2\u0db1\u0dca\u0d85\u0db4\u0dd2 \u0da2\u0dcf\u0dbd \u0daf\u0dd9\u0d9a\u0d9a\u0dca \u0db1\u0dd2\u0dbb\u0dca\u0db8\u0dcf\u0dab\u0dba <span translate=no>_^_2_^_</span> <span translate=no>_^_3_^_</span> \u0d9a\u0dbb <span translate=no>_^_4_^_</span> \u0d94\u0dc0\u0dd4\u0db1\u0dca\u0d9c\u0dd9\u0db1\u0dca \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dd2\u0db8\u0dd4. <span translate=no>_^_5_^_</span> \u0d85\u0db4\u0dd2 <span translate=no>_^_6_^_</span> \u0dc3\u0dc4 <span translate=no>_^_7_^_</span> \u0da2\u0dcf\u0dbd \u0dc0\u0dbd \u0d86\u0dbb\u0db8\u0dca\u0db7\u0d9a \u0dc3\u0dca\u0dae\u0dbb \u0db6\u0dd9\u0daf\u0dcf \u0d9c\u0db1\u0dd2\u0db8\u0dd4. </p>\n",
|
||||
"<h2>Dueling Network \u2694\ufe0f Model for <span translate=no>_^_0_^_</span> Values</h2>\n<p>We are using a <a href=\"https://arxiv.org/abs/1511.06581\">dueling network</a> to calculate Q-values. Intuition behind dueling network architecture is that in most states the action doesn't matter, and in some states the action is significant. Dueling network allows this to be represented very well.</p>\n<span translate=no>_^_1_^_</span><p>So we create two networks for <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> and get <span translate=no>_^_4_^_</span> from them. <span translate=no>_^_5_^_</span> We share the initial layers of the <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> networks.</p>\n": "<h2>\u0da2\u0dcf\u0dbd\u0dbaDueling \u2694\ufe0f <span translate=no>_^_0_^_</span> \u0dc0\u0da7\u0dd2\u0db1\u0dcf\u0d9a\u0db8\u0dca \u0dc3\u0db3\u0dc4\u0dcf \u0d86\u0d9a\u0dd8\u0dad\u0dd2\u0dba</h2>\n<p>Q-\u0d85\u0d9c\u0dba\u0db1\u0dca\u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0d85\u0db4\u0dd2 <a href=\"https://arxiv.org/abs/1511.06581\">\u0da9\u0dd6\u0dbd\u0dd2\u0d82 \u0da2\u0dcf\u0dbd\u0dba\u0d9a\u0dca</a> \u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db8\u0dd4. \u0da2\u0dcf\u0dbd \u0d9c\u0dd8\u0dc4 \u0db1\u0dd2\u0dbb\u0dca\u0db8\u0dcf\u0dab \u0dc1\u0dd2\u0dbd\u0dca\u0db4\u0dba dueling \u0db4\u0dd2\u0da7\u0dd4\u0db4\u0dc3 \u0d87\u0dad\u0dd2 \u0db4\u0dca\u0dbb\u0dad\u0dd2\u0db7\u0dcf\u0db1\u0dba \u0db1\u0db8\u0dca, \u0db6\u0ddc\u0dc4\u0ddd \u0db4\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dad\u0dc0\u0dbd \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dc0 \u0dc0\u0dd0\u0daf\u0d9c\u0dad\u0dca \u0db1\u0ddc\u0dc0\u0db1 \u0d85\u0dad\u0dbb \u0dc3\u0db8\u0dc4\u0dbb \u0db4\u0dca\u0dbb\u0dcf\u0db1\u0dca\u0dad\u0dc0\u0dbd \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dc0 \u0dc3\u0dd0\u0dbd\u0d9a\u0dd2\u0dba \u0dba\u0dd4\u0dad\u0dd4 \u0dba. Dueling \u0da2\u0dcf\u0dbd\u0dba \u0db8\u0dd9\u0dba \u0d89\u0dad\u0dcf \u0dc4\u0ddc\u0db3\u0dd2\u0db1\u0dca \u0db1\u0dd2\u0dbb\u0dd6\u0db4\u0dab\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0da7 \u0d89\u0da9 \u0daf\u0dd9\u0dba\u0dd2. </p>\n<span translate=no>_^_1_^_</span><p>\u0d91\u0db6\u0dd0\u0dc0\u0dd2\u0db1\u0dca\u0d85\u0db4\u0dd2 \u0da2\u0dcf\u0dbd \u0daf\u0dd9\u0d9a\u0d9a\u0dca \u0db1\u0dd2\u0dbb\u0dca\u0db8\u0dcf\u0dab\u0dba <span translate=no>_^_2_^_</span> <span translate=no>_^_3_^_</span> \u0d9a\u0dbb <span translate=no>_^_4_^_</span> \u0d94\u0dc0\u0dd4\u0db1\u0dca\u0d9c\u0dd9\u0db1\u0dca \u0dbd\u0db6\u0dcf \u0d9c\u0db1\u0dd2\u0db8\u0dd4. <span translate=no>_^_5_^_</span> \u0d85\u0db4\u0dd2 <span translate=no>_^_6_^_</span> \u0dc3\u0dc4 <span translate=no>_^_7_^_</span> \u0da2\u0dcf\u0dbd \u0dc0\u0dbd \u0d86\u0dbb\u0db8\u0dca\u0db7\u0d9a \u0dc3\u0dca\u0dae\u0dbb \u0db6\u0dd9\u0daf\u0dcf \u0d9c\u0db1\u0dd2\u0db8\u0dd4. </p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span> </p>\n",
|
||||
"<p>A fully connected layer takes the flattened frame from third convolution layer, and outputs <span translate=no>_^_0_^_</span> features </p>\n": "<p>\u0dc3\u0db8\u0dca\u0db4\u0dd4\u0dbb\u0dca\u0dab\u0dba\u0dd9\u0db1\u0dca\u0db8\u0dc3\u0db8\u0dca\u0db6\u0db1\u0dca\u0db0\u0dd2\u0dad \u0dad\u0da7\u0dca\u0da7\u0dd4\u0dc0\u0d9a\u0dca \u0db4\u0dd0\u0dad\u0dbd\u0dd2 \u0dbb\u0dcf\u0db8\u0dd4\u0dc0 \u0dad\u0dd9\u0dc0\u0db1 \u0d9a\u0dd0\u0da7\u0dd2 \u0d9c\u0dd0\u0dc3\u0dd4\u0dab\u0dd4 \u0dc3\u0dca\u0dae\u0dbb\u0dba\u0dd9\u0db1\u0dca \u0d9c\u0db1\u0dca\u0db1\u0dcf \u0d85\u0dad\u0dbb <span translate=no>_^_0_^_</span> \u0dc0\u0dd2\u0dc1\u0dda\u0dc2\u0dcf\u0d82\u0d9c \u0db4\u0dca\u0dbb\u0dad\u0dd2\u0daf\u0dcf\u0db1\u0dba \u0d9a\u0dbb\u0dba\u0dd2 </p>\n",
|
||||
"<p>Convolution </p>\n": "<p>\u0dc3\u0d82\u0dc0\u0dbd\u0dd2\u0dad </p>\n",
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
{
|
||||
"<h1>Deep Q Network (DQN) Model</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u6df1\u5ea6 Q \u7f51\u7edc (DQN) \u6a21\u578b</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Dueling Network \u2694\ufe0f Model for <span translate=no>_^_0_^_</span> Values</h2>\n<p>We are using a <a href=\"https://papers.labml.ai/paper/1511.06581\">dueling network</a> to calculate Q-values. Intuition behind dueling network architecture is that in most states the action doesn't matter, and in some states the action is significant. Dueling network allows this to be represented very well.</p>\n<span translate=no>_^_1_^_</span><p>So we create two networks for <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> and get <span translate=no>_^_4_^_</span> from them. <span translate=no>_^_5_^_</span> We share the initial layers of the <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> networks.</p>\n": "<h2>\u51b3\u6597\u7f51\u7edc \u2694\ufe0f<span translate=no>_^_0_^_</span> \u4ef7\u503c\u89c2\u6a21\u578b</h2>\n<p>\u6211\u4eec\u6b63\u5728\u4f7f\u7528\u51b3<a href=\"https://papers.labml.ai/paper/1511.06581\">\u6597\u7f51\u7edc</a>\u6765\u8ba1\u7b97 Q \u503c\u3002\u51b3\u6597\u7f51\u7edc\u67b6\u6784\u80cc\u540e\u7684\u76f4\u89c9\u662f\uff0c\u5728\u5927\u591a\u6570\u5dde\uff0c\u884c\u52a8\u65e0\u5173\u7d27\u8981\uff0c\u800c\u5728\u67d0\u4e9b\u5dde\uff0c\u884c\u52a8\u610f\u4e49\u91cd\u5927\u3002\u51b3\u6597\u7f51\u7edc\u53ef\u4ee5\u5f88\u597d\u5730\u4f53\u73b0\u8fd9\u4e00\u70b9\u3002</p>\n<span translate=no>_^_1_^_</span><p>\u56e0\u6b64\uff0c\u6211\u4eec\u4e3a<span translate=no>_^_2_^_</span>\u548c\u521b\u5efa\u4e86\u4e24\u4e2a\u7f51\u7edc\uff0c<span translate=no>_^_3_^_</span>\u7136\u540e<span translate=no>_^_4_^_</span>\u4ece\u4e2d\u83b7\u53d6\u3002<span translate=no>_^_5_^_</span>\u6211\u4eec\u5171\u4eab<span translate=no>_^_6_^_</span>\u548c<span translate=no>_^_7_^_</span>\u7f51\u7edc\u7684\u521d\u59cb\u5c42\u3002</p>\n",
|
||||
"<h2>Dueling Network \u2694\ufe0f Model for <span translate=no>_^_0_^_</span> Values</h2>\n<p>We are using a <a href=\"https://arxiv.org/abs/1511.06581\">dueling network</a> to calculate Q-values. Intuition behind dueling network architecture is that in most states the action doesn't matter, and in some states the action is significant. Dueling network allows this to be represented very well.</p>\n<span translate=no>_^_1_^_</span><p>So we create two networks for <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> and get <span translate=no>_^_4_^_</span> from them. <span translate=no>_^_5_^_</span> We share the initial layers of the <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> networks.</p>\n": "<h2>\u51b3\u6597\u7f51\u7edc \u2694\ufe0f<span translate=no>_^_0_^_</span> \u4ef7\u503c\u89c2\u6a21\u578b</h2>\n<p>\u6211\u4eec\u6b63\u5728\u4f7f\u7528\u51b3<a href=\"https://arxiv.org/abs/1511.06581\">\u6597\u7f51\u7edc</a>\u6765\u8ba1\u7b97 Q \u503c\u3002\u51b3\u6597\u7f51\u7edc\u67b6\u6784\u80cc\u540e\u7684\u76f4\u89c9\u662f\uff0c\u5728\u5927\u591a\u6570\u5dde\uff0c\u884c\u52a8\u65e0\u5173\u7d27\u8981\uff0c\u800c\u5728\u67d0\u4e9b\u5dde\uff0c\u884c\u52a8\u610f\u4e49\u91cd\u5927\u3002\u51b3\u6597\u7f51\u7edc\u53ef\u4ee5\u5f88\u597d\u5730\u4f53\u73b0\u8fd9\u4e00\u70b9\u3002</p>\n<span translate=no>_^_1_^_</span><p>\u56e0\u6b64\uff0c\u6211\u4eec\u4e3a<span translate=no>_^_2_^_</span>\u548c\u521b\u5efa\u4e86\u4e24\u4e2a\u7f51\u7edc\uff0c<span translate=no>_^_3_^_</span>\u7136\u540e<span translate=no>_^_4_^_</span>\u4ece\u4e2d\u83b7\u53d6\u3002<span translate=no>_^_5_^_</span>\u6211\u4eec\u5171\u4eab<span translate=no>_^_6_^_</span>\u548c<span translate=no>_^_7_^_</span>\u7f51\u7edc\u7684\u521d\u59cb\u5c42\u3002</p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>A fully connected layer takes the flattened frame from third convolution layer, and outputs <span translate=no>_^_0_^_</span> features </p>\n": "<p>\u5b8c\u5168\u8fde\u63a5\u7684\u56fe\u5c42\u4ece\u7b2c\u4e09\u4e2a\u5377\u79ef\u56fe\u5c42\u83b7\u53d6\u5c55\u5e73\u7684\u5e27\uff0c\u5e76\u8f93\u51fa<span translate=no>_^_0_^_</span>\u8981\u7d20</p>\n",
|
||||
"<p>Convolution </p>\n": "<p>\u5377\u79ef</p>\n",
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
{
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">Deep Q Networks (DQN)</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"https://nn.labml.ai/rl/dqn/model.html\">Dueling Network</a>, <a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">experiment</a> and <a href=\"https://nn.labml.ai/rl/dqn/model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN)</a></h1>\n<p>\u3053\u308c\u306f\u3001<a href=\"https://papers.labml.ai/paper/1312.5602\">\u30c7\u30a3\u30fc\u30d7\u5f37\u5316\u5b66\u7fd2\u3092\u4f7f\u3063\u305f\u30a2\u30bf\u30ea\u30d7\u30ec\u30a4\u3068\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af</a><a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u3001<a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">\u512a\u5148\u30ea\u30d7\u30ec\u30a4</a>\u3001<a href=\"https://pytorch.org\">\u30c0\u30d6\u30ebQ\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092PyTorch\u3067\u5b9f\u88c5\u3057\u305f\u3082\u306e\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"https://nn.labml.ai/rl/dqn/experiment.html\"><a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u3053\u308c\u304c\u5b9f\u9a13\u3068\u30e2\u30c7\u30eb\u306e\u5b9f\u88c5\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">Deep Q Networks (DQN)</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://arxiv.org/abs/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"https://nn.labml.ai/rl/dqn/model.html\">Dueling Network</a>, <a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">experiment</a> and <a href=\"https://nn.labml.ai/rl/dqn/model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN)</a></h1>\n<p>\u3053\u308c\u306f\u3001<a href=\"https://arxiv.org/abs/1312.5602\">\u30c7\u30a3\u30fc\u30d7\u5f37\u5316\u5b66\u7fd2\u3092\u4f7f\u3063\u305f\u30a2\u30bf\u30ea\u30d7\u30ec\u30a4\u3068\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af</a><a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u3001<a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">\u512a\u5148\u30ea\u30d7\u30ec\u30a4</a>\u3001<a href=\"https://pytorch.org\">\u30c0\u30d6\u30ebQ\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092PyTorch\u3067\u5b9f\u88c5\u3057\u305f\u3082\u306e\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"https://nn.labml.ai/rl/dqn/experiment.html\"><a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u3053\u308c\u304c\u5b9f\u9a13\u3068\u30e2\u30c7\u30eb\u306e\u5b9f\u88c5\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"Deep Q Networks (DQN)": "\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN)"
|
||||
}
|
||||
@ -1,4 +1,4 @@
|
||||
{
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">Deep Q Networks (DQN)</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"https://nn.labml.ai/rl/dqn/model.html\">Dueling Network</a>, <a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">experiment</a> and <a href=\"https://nn.labml.ai/rl/dqn/model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"><span translate=no>_^_1_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">\u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4 Q \u0da2\u0dcf\u0dbd (DQN)</a></h1>\n<p>\u0db8\u0dd9\u0dba <a href=\"https://pytorch.org\">PyTorch</a> \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0d9a\u0dd2 \u0d9a\u0da9\u0daf\u0dcf\u0dc3\u0dd2 <a href=\"https://papers.labml.ai/paper/1312.5602\">\u0dc3\u0dd9\u0dbd\u0dca\u0dbd\u0db8\u0dca \u0d85\u0da7\u0dcf\u0dbb\u0dd2 \u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4 \u0dc1\u0d9a\u0dca\u0dad\u0dd2\u0db8\u0dad\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d89\u0d9c\u0dd9\u0db1\u0dd3\u0db8</a> \u0dc3\u0dc4 <a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u0da9\u0dd4\u0dbd\u0dd2\u0d82 \u0da2\u0dcf\u0dbd\u0dba</a> \u0dc3\u0db8\u0d9f, <a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">\u0db4\u0dca\u0dbb\u0db8\u0dd4\u0d9b\u0dad\u0dcf \u0db1\u0dd0\u0dc0\u0dad \u0db0\u0dcf\u0dc0\u0db1\u0dba</a> \u0dc3\u0dc4 \u0daf\u0dca\u0dc0\u0dd2\u0dad\u0dca\u0dc0 Q \u0da2\u0dcf\u0dbd\u0dba. </p>\n<p>\u0db8\u0dd9\u0db1\u0dca\u0db1 <a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">\u0d85\u0dad\u0dca\u0dc4\u0daf\u0dcf</a> \u0db6\u0dd0\u0dbd\u0dd3\u0db8 \u0dc3\u0dc4 <a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u0d86\u0daf\u0dbb\u0dca\u0dc1</a> \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8. </p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"> <span translate=no>_^_1_^_</span></a> </p>\n",
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">Deep Q Networks (DQN)</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://arxiv.org/abs/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"https://nn.labml.ai/rl/dqn/model.html\">Dueling Network</a>, <a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">experiment</a> and <a href=\"https://nn.labml.ai/rl/dqn/model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"><span translate=no>_^_1_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">\u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4 Q \u0da2\u0dcf\u0dbd (DQN)</a></h1>\n<p>\u0db8\u0dd9\u0dba <a href=\"https://pytorch.org\">PyTorch</a> \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0d9a\u0dd2 \u0d9a\u0da9\u0daf\u0dcf\u0dc3\u0dd2 <a href=\"https://arxiv.org/abs/1312.5602\">\u0dc3\u0dd9\u0dbd\u0dca\u0dbd\u0db8\u0dca \u0d85\u0da7\u0dcf\u0dbb\u0dd2 \u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4 \u0dc1\u0d9a\u0dca\u0dad\u0dd2\u0db8\u0dad\u0dca \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0dda \u0d89\u0d9c\u0dd9\u0db1\u0dd3\u0db8</a> \u0dc3\u0dc4 <a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u0da9\u0dd4\u0dbd\u0dd2\u0d82 \u0da2\u0dcf\u0dbd\u0dba</a> \u0dc3\u0db8\u0d9f, <a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">\u0db4\u0dca\u0dbb\u0db8\u0dd4\u0d9b\u0dad\u0dcf \u0db1\u0dd0\u0dc0\u0dad \u0db0\u0dcf\u0dc0\u0db1\u0dba</a> \u0dc3\u0dc4 \u0daf\u0dca\u0dc0\u0dd2\u0dad\u0dca\u0dc0 Q \u0da2\u0dcf\u0dbd\u0dba. </p>\n<p>\u0db8\u0dd9\u0db1\u0dca\u0db1 <a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">\u0d85\u0dad\u0dca\u0dc4\u0daf\u0dcf</a> \u0db6\u0dd0\u0dbd\u0dd3\u0db8 \u0dc3\u0dc4 <a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u0d86\u0daf\u0dbb\u0dca\u0dc1</a> \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8. </p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> <a href=\"https://app.labml.ai/run/fe1ad986237511ec86e8b763a2d3f710\"> <span translate=no>_^_1_^_</span></a> </p>\n",
|
||||
"Deep Q Networks (DQN)": "\u0d9c\u0dd0\u0db9\u0dd4\u0dbb\u0dd4 Q \u0da2\u0dcf\u0dbd (DQN)"
|
||||
}
|
||||
@ -1,4 +1,4 @@
|
||||
{
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">Deep Q Networks (DQN)</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"https://nn.labml.ai/rl/dqn/model.html\">Dueling Network</a>, <a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">experiment</a> and <a href=\"https://nn.labml.ai/rl/dqn/model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">\u6df1\u5ea6 Q \u7f51\u7edc (DQN)</a></h1>\n<p>\u8fd9\u662f <a href=\"https://pytorch.org\">PyTorch</a> \u5b9e\u73b0\u7684 PyTorch <a href=\"https://papers.labml.ai/paper/1312.5602\">\u4f7f\u7528\u6df1\u5ea6\u5f3a\u5316\u5b66\u4e60\u73a9\u96c5</a>\u8fbe\u5229\u4ee5\u53ca<a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u51b3\u6597\u7f51\u7edc</a>\u3001<a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">\u4f18\u5148\u56de\u653e</a>\u548c Double Q Network\u3002</p>\n<p>\u8fd9\u662f<a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">\u5b9e\u9a8c</a>\u548c<a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u6a21\u578b</a>\u5b9e\u73b0\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">Deep Q Networks (DQN)</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://arxiv.org/abs/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"https://nn.labml.ai/rl/dqn/model.html\">Dueling Network</a>, <a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">experiment</a> and <a href=\"https://nn.labml.ai/rl/dqn/model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">\u6df1\u5ea6 Q \u7f51\u7edc (DQN)</a></h1>\n<p>\u8fd9\u662f <a href=\"https://pytorch.org\">PyTorch</a> \u5b9e\u73b0\u7684 PyTorch <a href=\"https://arxiv.org/abs/1312.5602\">\u4f7f\u7528\u6df1\u5ea6\u5f3a\u5316\u5b66\u4e60\u73a9\u96c5</a>\u8fbe\u5229\u4ee5\u53ca<a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u51b3\u6597\u7f51\u7edc</a>\u3001<a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">\u4f18\u5148\u56de\u653e</a>\u548c Double Q Network\u3002</p>\n<p>\u8fd9\u662f<a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">\u5b9e\u9a8c</a>\u548c<a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u6a21\u578b</a>\u5b9e\u73b0\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"Deep Q Networks (DQN)": "\u6df1\u5ea6\u95ee\u7b54\u7f51\u7edc (DQN)"
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@ -1,7 +1,7 @@
|
||||
{
|
||||
"<h1>Proximal Policy Optimization - PPO</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://papers.labml.ai/paper/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a single sample causes problems because the policy deviates too much, producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>. The experiment uses <a href=\"gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO</h1>\n<p><a href=\"https://papers.labml.ai/paper/1707.06347\">\u3053\u308c\u306f\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316</a>\uff08PPO\uff09<a href=\"https://pytorch.org\">\u306ePyTorch\u5b9f\u88c5\u3067\u3059</a>\u3002</p>\n<p>PPO\u306f\u5f37\u5316\u5b66\u7fd2\u306e\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u6cd5\u3067\u3059\u3002\u30b7\u30f3\u30d7\u30eb\u306a\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30e1\u30bd\u30c3\u30c9\u3067\u306f\u3001\u30b5\u30f3\u30d7\u30eb (\u307e\u305f\u306f\u30b5\u30f3\u30d7\u30eb\u30bb\u30c3\u30c8) \u3054\u3068\u306b 1 \u56de\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3044\u307e\u3059\u30021\u3064\u306e\u30b5\u30f3\u30d7\u30eb\u306b\u5bfe\u3057\u3066\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30b9\u30c6\u30c3\u30d7\u3092\u5b9f\u884c\u3059\u308b\u3068\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u504f\u5dee\u304c\u5927\u304d\u3059\u304e\u3066\u4e0d\u9069\u5207\u306a\u30dd\u30ea\u30b7\u30fc\u306b\u306a\u308b\u305f\u3081\u3001\u554f\u984c\u304c\u767a\u751f\u3057\u307e\u3059\u3002PPO \u3067\u306f\u3001\u30dd\u30ea\u30b7\u30fc\u3092\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3057\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u8fd1\u3044\u72b6\u614b\u306b\u4fdd\u3064\u3053\u3068\u3067\u3001\u30b5\u30f3\u30d7\u30eb\u3054\u3068\u306b\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3046\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002\u66f4\u65b0\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u304c\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u5408\u308f\u306a\u3044\u5834\u5408\u306f\u3001\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30d5\u30ed\u30fc\u3092\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u3057\u3066\u66f4\u65b0\u3057\u307e\u3059</p>\u3002\n<p><a href=\"experiment.html\">\u3053\u308c\u3092\u4f7f\u3063\u305f\u5b9f\u9a13\u306f\u3053\u3061\u3089\u304b\u3089\u3054\u89a7\u3044\u305f\u3060\u3051\u307e\u3059</a>\u3002\u3053\u306e\u5b9f\u9a13\u3067\u306f\u3001<a href=\"gae.html\">\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a></p>\u3002\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h1>Proximal Policy Optimization - PPO</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://arxiv.org/abs/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a single sample causes problems because the policy deviates too much, producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>. The experiment uses <a href=\"gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO</h1>\n<p><a href=\"https://arxiv.org/abs/1707.06347\">\u3053\u308c\u306f\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316</a>\uff08PPO\uff09<a href=\"https://pytorch.org\">\u306ePyTorch\u5b9f\u88c5\u3067\u3059</a>\u3002</p>\n<p>PPO\u306f\u5f37\u5316\u5b66\u7fd2\u306e\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u6cd5\u3067\u3059\u3002\u30b7\u30f3\u30d7\u30eb\u306a\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30e1\u30bd\u30c3\u30c9\u3067\u306f\u3001\u30b5\u30f3\u30d7\u30eb (\u307e\u305f\u306f\u30b5\u30f3\u30d7\u30eb\u30bb\u30c3\u30c8) \u3054\u3068\u306b 1 \u56de\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3044\u307e\u3059\u30021\u3064\u306e\u30b5\u30f3\u30d7\u30eb\u306b\u5bfe\u3057\u3066\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30b9\u30c6\u30c3\u30d7\u3092\u5b9f\u884c\u3059\u308b\u3068\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u504f\u5dee\u304c\u5927\u304d\u3059\u304e\u3066\u4e0d\u9069\u5207\u306a\u30dd\u30ea\u30b7\u30fc\u306b\u306a\u308b\u305f\u3081\u3001\u554f\u984c\u304c\u767a\u751f\u3057\u307e\u3059\u3002PPO \u3067\u306f\u3001\u30dd\u30ea\u30b7\u30fc\u3092\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3057\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u8fd1\u3044\u72b6\u614b\u306b\u4fdd\u3064\u3053\u3068\u3067\u3001\u30b5\u30f3\u30d7\u30eb\u3054\u3068\u306b\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3046\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002\u66f4\u65b0\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u304c\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u5408\u308f\u306a\u3044\u5834\u5408\u306f\u3001\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30d5\u30ed\u30fc\u3092\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u3057\u3066\u66f4\u65b0\u3057\u307e\u3059</p>\u3002\n<p><a href=\"experiment.html\">\u3053\u308c\u3092\u4f7f\u3063\u305f\u5b9f\u9a13\u306f\u3053\u3061\u3089\u304b\u3089\u3054\u89a7\u3044\u305f\u3060\u3051\u307e\u3059</a>\u3002\u3053\u306e\u5b9f\u9a13\u3067\u306f\u3001<a href=\"gae.html\">\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a></p>\u3002\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Clipped Value Function Loss</h2>\n<p>Similarly we clip the value function update also.</p>\n<span translate=no>_^_0_^_</span><p>Clipping makes sure the value function <span translate=no>_^_1_^_</span> doesn't deviate significantly from <span translate=no>_^_2_^_</span>.</p>\n": "<h2>\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u30d0\u30ea\u30e5\u30fc\u95a2\u6570\u306e\u640d\u5931</h2>\n<p>\u540c\u69d8\u306b\u3001\u5024\u95a2\u6570\u306e\u66f4\u65b0\u3082\u30af\u30ea\u30c3\u30d7\u3057\u307e\u3059\u3002</p>\n<span translate=no>_^_0_^_</span><p>\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u306b\u3088\u308a\u3001<span translate=no>_^_1_^_</span>\u5024\u95a2\u6570\u304c\u5927\u304d\u304f\u305a\u308c\u306a\u3044\u3088\u3046\u306b\u3057\u307e\u3059\u3002<span translate=no>_^_2_^_</span></p>\n",
|
||||
"<h2>PPO Loss</h2>\n<p>Here's how the PPO update rule is derived.</p>\n<p>We want to maximize policy reward <span translate=no>_^_0_^_</span> where <span translate=no>_^_1_^_</span> is the reward, <span translate=no>_^_2_^_</span> is the policy, <span translate=no>_^_3_^_</span> is a trajectory sampled from policy, and <span translate=no>_^_4_^_</span> is the discount factor between <span translate=no>_^_5_^_</span>.</p>\n<span translate=no>_^_6_^_</span><p>So, <span translate=no>_^_7_^_</span></p>\n<p>Define discounted-future state distribution, <span translate=no>_^_8_^_</span></p>\n<p>Then,</p>\n<span translate=no>_^_9_^_</span><p>Importance sampling <span translate=no>_^_10_^_</span> from <span translate=no>_^_11_^_</span>,</p>\n<span translate=no>_^_12_^_</span><p>Then we assume <span translate=no>_^_13_^_</span> and <span translate=no>_^_14_^_</span> are similar. The error we introduce to <span translate=no>_^_15_^_</span> by this assumption is bound by the KL divergence between <span translate=no>_^_16_^_</span> and <span translate=no>_^_17_^_</span>. <a href=\"https://papers.labml.ai/paper/1705.10528\">Constrained Policy Optimization</a> shows the proof of this. I haven't read it.</p>\n<span translate=no>_^_18_^_</span>": "<h2>PPO \u30ed\u30b9</h2>\n<p>PPO \u66f4\u65b0\u30eb\u30fc\u30eb\u306f\u6b21\u306e\u65b9\u6cd5\u3067\u5c0e\u304d\u51fa\u3055\u308c\u307e\u3059\u3002</p>\n<p><span translate=no>_^_0_^_</span>\u3053\u3053\u3067\u3001<span translate=no>_^_1_^_</span>\u304c\u5831\u916c\u3001\u304c\u30dd\u30ea\u30b7\u30fc\u3001<span translate=no>_^_2_^_</span>\u304c\u30dd\u30ea\u30b7\u30fc\u304b\u3089\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u305f\u8ecc\u8de1\u3001<span translate=no>_^_3_^_</span><span translate=no>_^_4_^_</span>\u305d\u3057\u3066\u305d\u306e\u9593\u306e\u5272\u5f15\u4fc2\u6570\u3067\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u5831\u916c\u3092\u6700\u5927\u5316\u3057\u305f\u3044\u3068\u8003\u3048\u3066\u3044\u307e\u3059\u3002<span translate=no>_^_5_^_</span></p>\n<span translate=no>_^_6_^_</span><p>\u3060\u304b\u3089\u3001<span translate=no>_^_7_^_</span></p>\n<p>\u5272\u5f15\u5f8c\u306e\u5c06\u6765\u306e\u72b6\u614b\u5206\u5e03\u3092\u5b9a\u7fa9\u3057\u3001<span translate=no>_^_8_^_</span></p>\n<p>\u6b21\u306b\u3001</p>\n<span translate=no>_^_9_^_</span><p><span translate=no>_^_10_^_</span><span translate=no>_^_11_^_</span>\u304b\u3089\u306e\u91cd\u8981\u5ea6\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0</p>\n<span translate=no>_^_12_^_</span><p>\u305d\u3046\u3059\u308b\u3068\u3001<span translate=no>_^_13_^_</span><span translate=no>_^_14_^_</span>\u4f3c\u305f\u3088\u3046\u306a\u3082\u306e\u3060\u3068\u4eee\u5b9a\u3057\u307e\u3059\u3002<span translate=no>_^_15_^_</span>\u3053\u306e\u4eee\u5b9a\u306b\u3088\u3063\u3066\u751f\u3058\u308b\u8aa4\u5dee\u306f\u3001<span translate=no>_^_16_^_</span>\u3068\u306e\u9593\u306e KL \u306e\u76f8\u9055\u306b\u3088\u3063\u3066\u6c7a\u307e\u308a\u307e\u3059\u3002<span translate=no>_^_17_^_</span><a href=\"https://papers.labml.ai/paper/1705.10528\">\u5236\u7d04\u4ed8\u304d\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316\u306f\u305d\u306e\u8a3c\u62e0\u3067\u3059</a>\u3002\u307e\u3060\u8aad\u3093\u3067\u306a\u3044\u3088\u3002</p>\n<span translate=no>_^_18_^_</span>",
|
||||
"<h2>PPO Loss</h2>\n<p>Here's how the PPO update rule is derived.</p>\n<p>We want to maximize policy reward <span translate=no>_^_0_^_</span> where <span translate=no>_^_1_^_</span> is the reward, <span translate=no>_^_2_^_</span> is the policy, <span translate=no>_^_3_^_</span> is a trajectory sampled from policy, and <span translate=no>_^_4_^_</span> is the discount factor between <span translate=no>_^_5_^_</span>.</p>\n<span translate=no>_^_6_^_</span><p>So, <span translate=no>_^_7_^_</span></p>\n<p>Define discounted-future state distribution, <span translate=no>_^_8_^_</span></p>\n<p>Then,</p>\n<span translate=no>_^_9_^_</span><p>Importance sampling <span translate=no>_^_10_^_</span> from <span translate=no>_^_11_^_</span>,</p>\n<span translate=no>_^_12_^_</span><p>Then we assume <span translate=no>_^_13_^_</span> and <span translate=no>_^_14_^_</span> are similar. The error we introduce to <span translate=no>_^_15_^_</span> by this assumption is bound by the KL divergence between <span translate=no>_^_16_^_</span> and <span translate=no>_^_17_^_</span>. <a href=\"https://arxiv.org/abs/1705.10528\">Constrained Policy Optimization</a> shows the proof of this. I haven't read it.</p>\n<span translate=no>_^_18_^_</span>": "<h2>PPO \u30ed\u30b9</h2>\n<p>PPO \u66f4\u65b0\u30eb\u30fc\u30eb\u306f\u6b21\u306e\u65b9\u6cd5\u3067\u5c0e\u304d\u51fa\u3055\u308c\u307e\u3059\u3002</p>\n<p><span translate=no>_^_0_^_</span>\u3053\u3053\u3067\u3001<span translate=no>_^_1_^_</span>\u304c\u5831\u916c\u3001\u304c\u30dd\u30ea\u30b7\u30fc\u3001<span translate=no>_^_2_^_</span>\u304c\u30dd\u30ea\u30b7\u30fc\u304b\u3089\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u305f\u8ecc\u8de1\u3001<span translate=no>_^_3_^_</span><span translate=no>_^_4_^_</span>\u305d\u3057\u3066\u305d\u306e\u9593\u306e\u5272\u5f15\u4fc2\u6570\u3067\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u5831\u916c\u3092\u6700\u5927\u5316\u3057\u305f\u3044\u3068\u8003\u3048\u3066\u3044\u307e\u3059\u3002<span translate=no>_^_5_^_</span></p>\n<span translate=no>_^_6_^_</span><p>\u3060\u304b\u3089\u3001<span translate=no>_^_7_^_</span></p>\n<p>\u5272\u5f15\u5f8c\u306e\u5c06\u6765\u306e\u72b6\u614b\u5206\u5e03\u3092\u5b9a\u7fa9\u3057\u3001<span translate=no>_^_8_^_</span></p>\n<p>\u6b21\u306b\u3001</p>\n<span translate=no>_^_9_^_</span><p><span translate=no>_^_10_^_</span><span translate=no>_^_11_^_</span>\u304b\u3089\u306e\u91cd\u8981\u5ea6\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0</p>\n<span translate=no>_^_12_^_</span><p>\u305d\u3046\u3059\u308b\u3068\u3001<span translate=no>_^_13_^_</span><span translate=no>_^_14_^_</span>\u4f3c\u305f\u3088\u3046\u306a\u3082\u306e\u3060\u3068\u4eee\u5b9a\u3057\u307e\u3059\u3002<span translate=no>_^_15_^_</span>\u3053\u306e\u4eee\u5b9a\u306b\u3088\u3063\u3066\u751f\u3058\u308b\u8aa4\u5dee\u306f\u3001<span translate=no>_^_16_^_</span>\u3068\u306e\u9593\u306e KL \u306e\u76f8\u9055\u306b\u3088\u3063\u3066\u6c7a\u307e\u308a\u307e\u3059\u3002<span translate=no>_^_17_^_</span><a href=\"https://arxiv.org/abs/1705.10528\">\u5236\u7d04\u4ed8\u304d\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316\u306f\u305d\u306e\u8a3c\u62e0\u3067\u3059</a>\u3002\u307e\u3060\u8aad\u3093\u3067\u306a\u3044\u3088\u3002</p>\n<span translate=no>_^_18_^_</span>",
|
||||
"<h3>Cliping the policy ratio</h3>\n<span translate=no>_^_0_^_</span><p>The ratio is clipped to be close to 1. We take the minimum so that the gradient will only pull <span translate=no>_^_1_^_</span> towards <span translate=no>_^_2_^_</span> if the ratio is not between <span translate=no>_^_3_^_</span> and <span translate=no>_^_4_^_</span>. This keeps the KL divergence between <span translate=no>_^_5_^_</span> and <span translate=no>_^_6_^_</span> constrained. Large deviation can cause performance collapse; where the policy performance drops and doesn't recover because we are sampling from a bad policy.</p>\n<p>Using the normalized advantage <span translate=no>_^_7_^_</span> introduces a bias to the policy gradient estimator, but it reduces variance a lot. </p>\n": "<h3>\u30dd\u30ea\u30b7\u30fc\u6bd4\u7387\u306e\u30af\u30ea\u30c3\u30d4\u30f3\u30b0</h3>\n<span translate=no>_^_0_^_</span><p>\u6bd4\u7387\u306f 1 \u306b\u8fd1\u3065\u304f\u3088\u3046\u306b\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u3055\u308c\u307e\u3059\u3002<span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span><span translate=no>_^_3_^_</span>\u6bd4\u7387\u304c\u3068\u306e\u9593\u3067\u306a\u3044\u5834\u5408\u306b\u306e\u307f\u52fe\u914d\u304c\u50be\u304f\u3088\u3046\u306b\u6700\u5c0f\u5316\u3057\u3066\u3044\u307e\u3059<span translate=no>_^_4_^_</span>\u3002\u3053\u308c\u306b\u3088\u308a\u3001\u3068\u306e\u9593\u306e KL <span translate=no>_^_5_^_</span> \u306e\u76f8\u9055\u304c\u6291\u3048\u3089\u308c\u307e\u3059<span translate=no>_^_6_^_</span>\u3002\u5927\u304d\u306a\u504f\u5dee\u304c\u3042\u308b\u3068\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u304c\u4f4e\u4e0b\u3057\u3001\u4e0d\u9069\u5207\u306a\u30dd\u30ea\u30b7\u30fc\u304b\u3089\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3057\u3066\u3044\u308b\u305f\u3081\u306b\u30dd\u30ea\u30b7\u30fc\u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u304c\u4f4e\u4e0b\u3057\u3001\u56de\u5fa9\u3057\u306a\u3044\u5834\u5408\u304c\u3042\u308a\u307e\u3059\u3002</p>\n<p>\u6b63\u898f\u5316\u3055\u308c\u305f\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u3092\u4f7f\u7528\u3059\u308b\u3068\u3001<span translate=no>_^_7_^_</span>\u30dd\u30ea\u30b7\u30fc\u52fe\u914d\u63a8\u5b9a\u91cf\u306b\u504f\u308a\u304c\u751f\u3058\u307e\u3059\u304c\u3001\u5206\u6563\u306f\u5927\u5e45\u306b\u6e1b\u5c11\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p>ratio <span translate=no>_^_0_^_</span>; <em>this is different from rewards</em> <span translate=no>_^_1_^_</span>. </p>\n": "<p>\u6bd4\u7387<span translate=no>_^_0_^_</span>\u3002<em>\u3053\u308c\u306f\u5831\u916c\u3068\u306f\u7570\u306a\u308a\u307e\u3059</em><span translate=no>_^_1_^_</span>\u3002</p>\n",
|
||||
"An annotated implementation of Proximal Policy Optimization - PPO algorithm in PyTorch.": "PyTorch\u306e\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO\u30a2\u30eb\u30b4\u30ea\u30ba\u30e0\u306e\u6ce8\u91c8\u4ed8\u304d\u5b9f\u88c5\u3002",
|
||||
|
||||
File diff suppressed because one or more lines are too long
@ -1,7 +1,7 @@
|
||||
{
|
||||
"<h1>Proximal Policy Optimization - PPO</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://papers.labml.ai/paper/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a single sample causes problems because the policy deviates too much, producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>. The experiment uses <a href=\"gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO</h1>\n<p>\u8fd9\u662f P <a href=\"https://pytorch.org\">yTorch</a> \u5b9e\u73b0\u7684<a href=\"https://papers.labml.ai/paper/1707.06347\">\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO</a>\u3002</p>\n<p>PPO \u662f\u4e00\u79cd\u7528\u4e8e\u5f3a\u5316\u5b66\u4e60\u7684\u7b56\u7565\u68af\u5ea6\u65b9\u6cd5\u3002\u7b80\u5355\u7684\u7b56\u7565\u68af\u5ea6\u65b9\u6cd5\u5bf9\u6bcf\u4e2a\u6837\u672c\uff08\u6216\u4e00\u7ec4\u6837\u672c\uff09\u8fdb\u884c\u4e00\u6b21\u68af\u5ea6\u66f4\u65b0\u3002\u5bf9\u5355\u4e2a\u6837\u672c\u6267\u884c\u591a\u4e2a\u68af\u5ea6\u6b65\u9aa4\u4f1a\u5bfc\u81f4\u95ee\u9898\uff0c\u56e0\u4e3a\u7b56\u7565\u504f\u5dee\u592a\u5927\uff0c\u4ece\u800c\u4ea7\u751f\u9519\u8bef\u7684\u7b56\u7565\u3002PPO \u5141\u8bb8\u6211\u4eec\u5728\u6bcf\u4e2a\u6837\u672c\u4e2d\u8fdb\u884c\u591a\u6b21\u68af\u5ea6\u66f4\u65b0\uff0c\u65b9\u6cd5\u662f\u5c3d\u91cf\u4f7f\u7b56\u7565\u4e0e\u7528\u4e8e\u91c7\u6837\u6570\u636e\u7684\u7b56\u7565\u4fdd\u6301\u4e00\u81f4\u3002\u5982\u679c\u66f4\u65b0\u540e\u7684\u7b56\u7565\u4e0e\u7528\u4e8e\u91c7\u6837\u6570\u636e\u7684\u7b56\u7565\u4e0d\u63a5\u8fd1\uff0c\u5219\u901a\u8fc7\u524a\u51cf\u68af\u5ea6\u6d41\u6765\u5b9e\u73b0\u6b64\u76ee\u7684\u3002</p>\n<p>\u4f60\u53ef\u4ee5<a href=\"experiment.html\">\u5728\u8fd9\u91cc</a>\u627e\u5230\u4e00\u4e2a\u4f7f\u7528\u5b83\u7684\u5b9e\u9a8c\u3002\u8be5\u5b9e\u9a8c\u4f7f\u7528<a href=\"gae.html\">\u5e7f\u4e49\u4f18\u52bf\u4f30\u8ba1</a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h1>Proximal Policy Optimization - PPO</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://arxiv.org/abs/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a single sample causes problems because the policy deviates too much, producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>. The experiment uses <a href=\"gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO</h1>\n<p>\u8fd9\u662f P <a href=\"https://pytorch.org\">yTorch</a> \u5b9e\u73b0\u7684<a href=\"https://arxiv.org/abs/1707.06347\">\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO</a>\u3002</p>\n<p>PPO \u662f\u4e00\u79cd\u7528\u4e8e\u5f3a\u5316\u5b66\u4e60\u7684\u7b56\u7565\u68af\u5ea6\u65b9\u6cd5\u3002\u7b80\u5355\u7684\u7b56\u7565\u68af\u5ea6\u65b9\u6cd5\u5bf9\u6bcf\u4e2a\u6837\u672c\uff08\u6216\u4e00\u7ec4\u6837\u672c\uff09\u8fdb\u884c\u4e00\u6b21\u68af\u5ea6\u66f4\u65b0\u3002\u5bf9\u5355\u4e2a\u6837\u672c\u6267\u884c\u591a\u4e2a\u68af\u5ea6\u6b65\u9aa4\u4f1a\u5bfc\u81f4\u95ee\u9898\uff0c\u56e0\u4e3a\u7b56\u7565\u504f\u5dee\u592a\u5927\uff0c\u4ece\u800c\u4ea7\u751f\u9519\u8bef\u7684\u7b56\u7565\u3002PPO \u5141\u8bb8\u6211\u4eec\u5728\u6bcf\u4e2a\u6837\u672c\u4e2d\u8fdb\u884c\u591a\u6b21\u68af\u5ea6\u66f4\u65b0\uff0c\u65b9\u6cd5\u662f\u5c3d\u91cf\u4f7f\u7b56\u7565\u4e0e\u7528\u4e8e\u91c7\u6837\u6570\u636e\u7684\u7b56\u7565\u4fdd\u6301\u4e00\u81f4\u3002\u5982\u679c\u66f4\u65b0\u540e\u7684\u7b56\u7565\u4e0e\u7528\u4e8e\u91c7\u6837\u6570\u636e\u7684\u7b56\u7565\u4e0d\u63a5\u8fd1\uff0c\u5219\u901a\u8fc7\u524a\u51cf\u68af\u5ea6\u6d41\u6765\u5b9e\u73b0\u6b64\u76ee\u7684\u3002</p>\n<p>\u4f60\u53ef\u4ee5<a href=\"experiment.html\">\u5728\u8fd9\u91cc</a>\u627e\u5230\u4e00\u4e2a\u4f7f\u7528\u5b83\u7684\u5b9e\u9a8c\u3002\u8be5\u5b9e\u9a8c\u4f7f\u7528<a href=\"gae.html\">\u5e7f\u4e49\u4f18\u52bf\u4f30\u8ba1</a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Clipped Value Function Loss</h2>\n<p>Similarly we clip the value function update also.</p>\n<span translate=no>_^_0_^_</span><p>Clipping makes sure the value function <span translate=no>_^_1_^_</span> doesn't deviate significantly from <span translate=no>_^_2_^_</span>.</p>\n": "<h2>\u524a\u51cf\u503c\u51fd\u6570\u635f\u5931</h2>\n<p>\u540c\u6837\uff0c\u6211\u4eec\u4e5f\u88c1\u526a\u503c\u51fd\u6570\u7684\u66f4\u65b0\u3002</p>\n<span translate=no>_^_0_^_</span><p>\u88c1\u526a\u53ef\u786e\u4fdd\u503c\u51fd\u6570<span translate=no>_^_1_^_</span>\u4e0d\u4f1a\u660e\u663e\u504f\u79bb<span translate=no>_^_2_^_</span>\u3002</p>\n",
|
||||
"<h2>PPO Loss</h2>\n<p>Here's how the PPO update rule is derived.</p>\n<p>We want to maximize policy reward <span translate=no>_^_0_^_</span> where <span translate=no>_^_1_^_</span> is the reward, <span translate=no>_^_2_^_</span> is the policy, <span translate=no>_^_3_^_</span> is a trajectory sampled from policy, and <span translate=no>_^_4_^_</span> is the discount factor between <span translate=no>_^_5_^_</span>.</p>\n<span translate=no>_^_6_^_</span><p>So, <span translate=no>_^_7_^_</span></p>\n<p>Define discounted-future state distribution, <span translate=no>_^_8_^_</span></p>\n<p>Then,</p>\n<span translate=no>_^_9_^_</span><p>Importance sampling <span translate=no>_^_10_^_</span> from <span translate=no>_^_11_^_</span>,</p>\n<span translate=no>_^_12_^_</span><p>Then we assume <span translate=no>_^_13_^_</span> and <span translate=no>_^_14_^_</span> are similar. The error we introduce to <span translate=no>_^_15_^_</span> by this assumption is bound by the KL divergence between <span translate=no>_^_16_^_</span> and <span translate=no>_^_17_^_</span>. <a href=\"https://papers.labml.ai/paper/1705.10528\">Constrained Policy Optimization</a> shows the proof of this. I haven't read it.</p>\n<span translate=no>_^_18_^_</span>": "<h2>PPO \u635f\u5931</h2>\n<p>\u4ee5\u4e0b\u662f PPO \u66f4\u65b0\u89c4\u5219\u7684\u6d3e\u751f\u65b9\u5f0f\u3002</p>\n<p>\u6211\u4eec\u5e0c\u671b\u6700\u5927\u9650\u5ea6\u5730\u63d0\u9ad8\u4fdd\u5355\u5956\u52b1<span translate=no>_^_0_^_</span>\u5728\u54ea\u91cc<span translate=no>_^_1_^_</span>\uff0c<span translate=no>_^_2_^_</span>\u5956\u52b1\u5728\u54ea\u91cc\uff0c<span translate=no>_^_3_^_</span>\u662f\u4fdd\u5355\uff0c\u662f\u4ece\u4fdd\u5355\u4e2d\u62bd\u6837\u7684\u8f68\u8ff9\uff0c<span translate=no>_^_4_^_</span>\u662f\u4ecb\u4e8e\u4e24\u8005\u4e4b\u95f4\u7684\u6298\u6263\u7cfb\u6570<span translate=no>_^_5_^_</span>\u3002</p>\n<span translate=no>_^_6_^_</span><p>\u6240\u4ee5\uff0c<span translate=no>_^_7_^_</span></p>\n<p>\u5b9a\u4e49\u6298\u6263\u672a\u6765\u72b6\u6001\u5206\u914d\uff0c<span translate=no>_^_8_^_</span></p>\n<p>\u90a3\u4e48\uff0c</p>\n<span translate=no>_^_9_^_</span><p>\u91cd\u8981\u6027\u62bd\u6837<span translate=no>_^_10_^_</span>\u6765\u81ea<span translate=no>_^_11_^_</span></p>\n<span translate=no>_^_12_^_</span><p>\u7136\u540e\u6211\u4eec\u5047\u8bbe<span translate=no>_^_13_^_</span>\u548c<span translate=no>_^_14_^_</span>\u662f\u76f8\u4f3c\u7684\u3002\u6211\u4eec<span translate=no>_^_15_^_</span>\u901a\u8fc7\u8fd9\u4e2a\u5047\u8bbe\u5f15\u5165\u7684\u8bef\u5dee\u53d7<span translate=no>_^_16_^_</span>\u548c\u4e4b\u95f4\u7684 KL \u5dee\u5f02\u7684\u7ea6\u675f<span translate=no>_^_17_^_</span>\u3002<a href=\"https://papers.labml.ai/paper/1705.10528\">\u7ea6\u675f\u7b56\u7565\u4f18\u5316</a>\u8bc1\u660e\u4e86\u8fd9\u4e00\u70b9\u3002\u6211\u8fd8\u6ca1\u770b\u8fc7\u3002</p>\n<span translate=no>_^_18_^_</span>",
|
||||
"<h2>PPO Loss</h2>\n<p>Here's how the PPO update rule is derived.</p>\n<p>We want to maximize policy reward <span translate=no>_^_0_^_</span> where <span translate=no>_^_1_^_</span> is the reward, <span translate=no>_^_2_^_</span> is the policy, <span translate=no>_^_3_^_</span> is a trajectory sampled from policy, and <span translate=no>_^_4_^_</span> is the discount factor between <span translate=no>_^_5_^_</span>.</p>\n<span translate=no>_^_6_^_</span><p>So, <span translate=no>_^_7_^_</span></p>\n<p>Define discounted-future state distribution, <span translate=no>_^_8_^_</span></p>\n<p>Then,</p>\n<span translate=no>_^_9_^_</span><p>Importance sampling <span translate=no>_^_10_^_</span> from <span translate=no>_^_11_^_</span>,</p>\n<span translate=no>_^_12_^_</span><p>Then we assume <span translate=no>_^_13_^_</span> and <span translate=no>_^_14_^_</span> are similar. The error we introduce to <span translate=no>_^_15_^_</span> by this assumption is bound by the KL divergence between <span translate=no>_^_16_^_</span> and <span translate=no>_^_17_^_</span>. <a href=\"https://arxiv.org/abs/1705.10528\">Constrained Policy Optimization</a> shows the proof of this. I haven't read it.</p>\n<span translate=no>_^_18_^_</span>": "<h2>PPO \u635f\u5931</h2>\n<p>\u4ee5\u4e0b\u662f PPO \u66f4\u65b0\u89c4\u5219\u7684\u6d3e\u751f\u65b9\u5f0f\u3002</p>\n<p>\u6211\u4eec\u5e0c\u671b\u6700\u5927\u9650\u5ea6\u5730\u63d0\u9ad8\u4fdd\u5355\u5956\u52b1<span translate=no>_^_0_^_</span>\u5728\u54ea\u91cc<span translate=no>_^_1_^_</span>\uff0c<span translate=no>_^_2_^_</span>\u5956\u52b1\u5728\u54ea\u91cc\uff0c<span translate=no>_^_3_^_</span>\u662f\u4fdd\u5355\uff0c\u662f\u4ece\u4fdd\u5355\u4e2d\u62bd\u6837\u7684\u8f68\u8ff9\uff0c<span translate=no>_^_4_^_</span>\u662f\u4ecb\u4e8e\u4e24\u8005\u4e4b\u95f4\u7684\u6298\u6263\u7cfb\u6570<span translate=no>_^_5_^_</span>\u3002</p>\n<span translate=no>_^_6_^_</span><p>\u6240\u4ee5\uff0c<span translate=no>_^_7_^_</span></p>\n<p>\u5b9a\u4e49\u6298\u6263\u672a\u6765\u72b6\u6001\u5206\u914d\uff0c<span translate=no>_^_8_^_</span></p>\n<p>\u90a3\u4e48\uff0c</p>\n<span translate=no>_^_9_^_</span><p>\u91cd\u8981\u6027\u62bd\u6837<span translate=no>_^_10_^_</span>\u6765\u81ea<span translate=no>_^_11_^_</span></p>\n<span translate=no>_^_12_^_</span><p>\u7136\u540e\u6211\u4eec\u5047\u8bbe<span translate=no>_^_13_^_</span>\u548c<span translate=no>_^_14_^_</span>\u662f\u76f8\u4f3c\u7684\u3002\u6211\u4eec<span translate=no>_^_15_^_</span>\u901a\u8fc7\u8fd9\u4e2a\u5047\u8bbe\u5f15\u5165\u7684\u8bef\u5dee\u53d7<span translate=no>_^_16_^_</span>\u548c\u4e4b\u95f4\u7684 KL \u5dee\u5f02\u7684\u7ea6\u675f<span translate=no>_^_17_^_</span>\u3002<a href=\"https://arxiv.org/abs/1705.10528\">\u7ea6\u675f\u7b56\u7565\u4f18\u5316</a>\u8bc1\u660e\u4e86\u8fd9\u4e00\u70b9\u3002\u6211\u8fd8\u6ca1\u770b\u8fc7\u3002</p>\n<span translate=no>_^_18_^_</span>",
|
||||
"<h3>Cliping the policy ratio</h3>\n<span translate=no>_^_0_^_</span><p>The ratio is clipped to be close to 1. We take the minimum so that the gradient will only pull <span translate=no>_^_1_^_</span> towards <span translate=no>_^_2_^_</span> if the ratio is not between <span translate=no>_^_3_^_</span> and <span translate=no>_^_4_^_</span>. This keeps the KL divergence between <span translate=no>_^_5_^_</span> and <span translate=no>_^_6_^_</span> constrained. Large deviation can cause performance collapse; where the policy performance drops and doesn't recover because we are sampling from a bad policy.</p>\n<p>Using the normalized advantage <span translate=no>_^_7_^_</span> introduces a bias to the policy gradient estimator, but it reduces variance a lot. </p>\n": "<h3>\u524a\u51cf\u4fdd\u5355\u6bd4\u7387</h3>\n<span translate=no>_^_0_^_</span><p>\u8be5\u6bd4\u7387\u88ab\u88c1\u526a\u4e3a\u63a5\u8fd1 1\u3002\u6211\u4eec\u53d6\u6700\u5c0f\u503c\uff0c\u4ee5\u4fbf\u53ea\u6709\u5f53\u6bd4\u7387\u4e0d\u5728<span translate=no>_^_3_^_</span>\u548c\u4e4b\u95f4\u65f6\uff0c\u68af\u5ea6\u624d\u4f1a\u62c9<span translate=no>_^_1_^_</span>\u5411<span translate=no>_^_2_^_</span><span translate=no>_^_4_^_</span>\u3002\u8fd9\u4fdd\u6301\u4e86 KL \u4e4b\u95f4\u7684\u5dee\u5f02<span translate=no>_^_5_^_</span>\u548c<span translate=no>_^_6_^_</span>\u9650\u5236\u3002\u8f83\u5927\u7684\u504f\u5dee\u53ef\u80fd\u5bfc\u81f4\u6027\u80fd\u4e0b\u964d\uff1b\u5728\u8fd9\u79cd\u60c5\u51b5\u4e0b\uff0c\u7b56\u7565\u6027\u80fd\u4f1a\u4e0b\u964d\u4e14\u65e0\u6cd5\u6062\u590d\uff0c\u56e0\u4e3a\u6211\u4eec\u6b63\u5728\u4ece\u4e0d\u826f\u7b56\u7565\u4e2d\u62bd\u6837\u3002</p>\n<p>\u4f7f\u7528\u5f52\u4e00\u5316\u4f18\u52bf\u4f1a\u7ed9\u653f\u7b56\u68af\u5ea6\u4f30\u8ba1\u5668<span translate=no>_^_7_^_</span>\u5e26\u6765\u504f\u5dee\uff0c\u4f46\u5b83\u5927\u5927\u51cf\u5c11\u4e86\u65b9\u5dee\u3002</p>\n",
|
||||
"<p>ratio <span translate=no>_^_0_^_</span>; <em>this is different from rewards</em> <span translate=no>_^_1_^_</span>. </p>\n": "<p>\u6bd4\u4f8b<span translate=no>_^_0_^_</span>\uff1b<em>\u8fd9\u4e0e\u5956\u52b1\u4e0d\u540c</em><span translate=no>_^_1_^_</span>\u3002</p>\n",
|
||||
"An annotated implementation of Proximal Policy Optimization - PPO algorithm in PyTorch.": "PyTorch \u4e2d\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO \u7b97\u6cd5\u7684\u5e26\u6ce8\u91ca\u5b9e\u73b0\u3002",
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
{
|
||||
"<h1>Generalized Advantage Estimation (GAE)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1506.02438\">Generalized Advantage Estimation</a>.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>.</p>\n": "<h1>\u4e00\u822c\u5316\u512a\u4f4d\u6027\u63a8\u5b9a (GAE)</h1>\n<p><a href=\"https://pytorch.org\"><a href=\"https://papers.labml.ai/paper/1506.02438\">\u3053\u308c\u306f\u7d19\u306e\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3092PyTorch\u3067\u5b9f\u88c5\u3057\u305f\u3082\u306e\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"experiment.html\">\u3053\u308c\u3092\u4f7f\u3063\u305f\u5b9f\u9a13\u306f\u3053\u3061\u3089\u304b\u3089\u3054\u89a7\u3044\u305f\u3060\u3051\u307e\u3059</a>\u3002</p>\n",
|
||||
"<h1>Generalized Advantage Estimation (GAE)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://arxiv.org/abs/1506.02438\">Generalized Advantage Estimation</a>.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>.</p>\n": "<h1>\u4e00\u822c\u5316\u512a\u4f4d\u6027\u63a8\u5b9a (GAE)</h1>\n<p><a href=\"https://pytorch.org\"><a href=\"https://arxiv.org/abs/1506.02438\">\u3053\u308c\u306f\u7d19\u306e\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3092PyTorch\u3067\u5b9f\u88c5\u3057\u305f\u3082\u306e\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"experiment.html\">\u3053\u308c\u3092\u4f7f\u3063\u305f\u5b9f\u9a13\u306f\u3053\u3061\u3089\u304b\u3089\u3054\u89a7\u3044\u305f\u3060\u3051\u307e\u3059</a>\u3002</p>\n",
|
||||
"<h3>Calculate advantages</h3>\n<span translate=no>_^_0_^_</span><p><span translate=no>_^_1_^_</span> is high bias, low variance, whilst <span translate=no>_^_2_^_</span> is unbiased, high variance.</p>\n<p>We take a weighted average of <span translate=no>_^_3_^_</span> to balance bias and variance. This is called Generalized Advantage Estimation. <span translate=no>_^_4_^_</span> We set <span translate=no>_^_5_^_</span>, this gives clean calculation for <span translate=no>_^_6_^_</span></p>\n<span translate=no>_^_7_^_</span>": "<h3>\u5229\u70b9\u3092\u8a08\u7b97</h3>\n<span translate=no>_^_0_^_</span><p><span translate=no>_^_1_^_</span>\u30d0\u30a4\u30a2\u30b9\u304c\u9ad8\u304f\u5206\u6563\u304c\u5c0f\u3055\u304f\u3001\u504f\u308a\u304c\u306a\u304f\u3001<span translate=no>_^_2_^_</span>\u5206\u6563\u304c\u5927\u304d\u3044\u3002</p>\n<p><span translate=no>_^_3_^_</span>\u30d0\u30a4\u30a2\u30b9\u3068\u5206\u6563\u306e\u30d0\u30e9\u30f3\u30b9\u3092\u53d6\u308b\u305f\u3081\u306b\u3001\u52a0\u91cd\u5e73\u5747\u3092\u53d6\u308a\u307e\u3059\u3002\u3053\u308c\u306f\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3068\u547c\u3070\u308c\u307e\u3059\u3002<span translate=no>_^_4_^_</span>\u8a2d\u5b9a\u3057\u307e\u3057\u305f\u3002\u3053\u308c\u306b\u3088\u308a<span translate=no>_^_5_^_</span>\u3001\u8a08\u7b97\u304c\u304d\u308c\u3044\u306b\u306a\u308a\u307e\u3059 <span translate=no>_^_6_^_</span></p>\n<span translate=no>_^_7_^_</span>",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>advantages table </p>\n": "<p>\u5229\u70b9\u8868</p>\n",
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
{
|
||||
"<h1>Generalized Advantage Estimation (GAE)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1506.02438\">Generalized Advantage Estimation</a>.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>.</p>\n": "<h1>\u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0d9a\u0dbb\u0dab\u0dba\u0d9a\u0dc5 \u0dc0\u0dcf\u0dc3\u0dd2 \u0d87\u0dc3\u0dca\u0dad\u0db8\u0dda\u0db1\u0dca\u0dad\u0dd4\u0dc0 (GAE)</h1>\n<p>\u0db8\u0dd9\u0dba <a href=\"https://pytorch.org\">PyTorch</a> \u0d9a\u0da9\u0daf\u0dcf\u0dc3\u0dd2 <a href=\"https://papers.labml.ai/paper/1506.02438\">\u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0d9a\u0dbb\u0dab\u0dba \u0d9a\u0dbb\u0db1 \u0dbd\u0daf \u0dc0\u0dcf\u0dc3\u0dd2 \u0d87\u0dc3\u0dca\u0dad\u0db8\u0dda\u0db1\u0dca\u0dad\u0dd4 \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0d9a\u0dd2</a> . </p>\n<p>\u0d91\u0dba\u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db1 \u0d85\u0dad\u0dca\u0dc4\u0daf\u0dcf \u0db6\u0dd0\u0dbd\u0dd3\u0db8\u0d9a\u0dca \u0d94\u0db6\u0da7 \u0dc3\u0ddc\u0dba\u0dcf\u0d9c\u0dad \u0dc4\u0dd0\u0d9a\u0dd2\u0dba <a href=\"experiment.html\">\u0db8\u0dd9\u0dc4\u0dd2</a>. </p>\n",
|
||||
"<h1>Generalized Advantage Estimation (GAE)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://arxiv.org/abs/1506.02438\">Generalized Advantage Estimation</a>.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>.</p>\n": "<h1>\u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0d9a\u0dbb\u0dab\u0dba\u0d9a\u0dc5 \u0dc0\u0dcf\u0dc3\u0dd2 \u0d87\u0dc3\u0dca\u0dad\u0db8\u0dda\u0db1\u0dca\u0dad\u0dd4\u0dc0 (GAE)</h1>\n<p>\u0db8\u0dd9\u0dba <a href=\"https://pytorch.org\">PyTorch</a> \u0d9a\u0da9\u0daf\u0dcf\u0dc3\u0dd2 <a href=\"https://arxiv.org/abs/1506.02438\">\u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0d9a\u0dbb\u0dab\u0dba \u0d9a\u0dbb\u0db1 \u0dbd\u0daf \u0dc0\u0dcf\u0dc3\u0dd2 \u0d87\u0dc3\u0dca\u0dad\u0db8\u0dda\u0db1\u0dca\u0dad\u0dd4 \u0d9a\u0dca\u0dbb\u0dd2\u0dba\u0dcf\u0dad\u0dca\u0db8\u0d9a \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0d9a\u0dd2</a> . </p>\n<p>\u0d91\u0dba\u0db7\u0dcf\u0dc0\u0dd2\u0dad\u0dcf \u0d9a\u0dbb\u0db1 \u0d85\u0dad\u0dca\u0dc4\u0daf\u0dcf \u0db6\u0dd0\u0dbd\u0dd3\u0db8\u0d9a\u0dca \u0d94\u0db6\u0da7 \u0dc3\u0ddc\u0dba\u0dcf\u0d9c\u0dad \u0dc4\u0dd0\u0d9a\u0dd2\u0dba <a href=\"experiment.html\">\u0db8\u0dd9\u0dc4\u0dd2</a>. </p>\n",
|
||||
"<h3>Calculate advantages</h3>\n<span translate=no>_^_0_^_</span><p><span translate=no>_^_1_^_</span> is high bias, low variance, whilst <span translate=no>_^_2_^_</span> is unbiased, high variance.</p>\n<p>We take a weighted average of <span translate=no>_^_3_^_</span> to balance bias and variance. This is called Generalized Advantage Estimation. <span translate=no>_^_4_^_</span> We set <span translate=no>_^_5_^_</span>, this gives clean calculation for <span translate=no>_^_6_^_</span></p>\n<span translate=no>_^_7_^_</span>": "<h3>\u0dc0\u0dcf\u0dc3\u0dd2\u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dbb\u0db1\u0dca\u0db1</h3>\n<span translate=no>_^_0_^_</span><p><span translate=no>_^_1_^_</span> \u0d89\u0dc4\u0dc5 \u0db1\u0dd0\u0db9\u0dd4\u0dbb\u0dd4\u0dc0, \u0d85\u0da9\u0dd4 \u0dc0\u0dd2\u0da0\u0dbd\u0dad\u0dcf\u0dc0, \u0d85\u0db4\u0d9a\u0dca\u0dc2\u0db4\u0dcf\u0dad\u0dd3 <span translate=no>_^_2_^_</span> \u0dc0\u0db1 \u0d85\u0dad\u0dbb \u0d89\u0dc4\u0dc5 \u0dc0\u0dd2\u0da0\u0dbd\u0dad\u0dcf\u0dc0. </p>\n<p>\u0db1\u0dd0\u0db9\u0dd4\u0dbb\u0dd4\u0dc0\u0dc3\u0dc4 \u0dc0\u0dd2\u0da0\u0dbd\u0dad\u0dcf\u0dc0 \u0dc3\u0db8\u0dad\u0dd4\u0dbd\u0dd2\u0dad <span translate=no>_^_3_^_</span> \u0d9a\u0dd2\u0dbb\u0dd3\u0db8 \u0dc3\u0db3\u0dc4\u0dcf \u0d85\u0db4\u0dd2 \u0db6\u0dbb \u0dad\u0dd0\u0db6\u0dd6 \u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0dba\u0d9a\u0dca \u0d9c\u0db1\u0dd2\u0db8\u0dd4. \u0db8\u0dd9\u0dba \u0dc3\u0dcf\u0db8\u0dcf\u0db1\u0dca\u0dba\u0d9a\u0dbb\u0dab\u0dba \u0d9a\u0dc5 \u0dc0\u0dcf\u0dc3\u0dd2 \u0d87\u0dc3\u0dca\u0dad\u0db8\u0dda\u0db1\u0dca\u0dad\u0dd4\u0dc0 \u0dbd\u0dd9\u0dc3 \u0dc4\u0dd0\u0db3\u0dd2\u0db1\u0dca\u0dc0\u0dda. <span translate=no>_^_4_^_</span> \u0d85\u0db4\u0dd2 \u0dc3\u0d9a\u0dc3\u0dca \u0d9a\u0dc5\u0dd9\u0db8\u0dd4 <span translate=no>_^_5_^_</span>, \u0db8\u0dd9\u0dba \u0db4\u0dd2\u0dbb\u0dd2\u0dc3\u0dd2\u0daf\u0dd4 \u0d9c\u0dab\u0db1\u0dba \u0d9a\u0dd2\u0dbb\u0dd3\u0db8\u0d9a\u0dca \u0dbd\u0db6\u0dcf \u0daf\u0dd9\u0dba\u0dd2 <span translate=no>_^_6_^_</span></p>\n<span translate=no>_^_7_^_</span>",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span> </p>\n",
|
||||
"<p>advantages table </p>\n": "<p>\u0dc0\u0dcf\u0dc3\u0dd2\u0dc0\u0d9c\u0dd4\u0dc0 </p>\n",
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
{
|
||||
"<h1>Generalized Advantage Estimation (GAE)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1506.02438\">Generalized Advantage Estimation</a>.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>.</p>\n": "<h1>\u5e7f\u4e49\u4f18\u52bf\u4f30\u8ba1 (GAE)</h1>\n<p>\u8fd9\u662f\u8bba\u6587<a href=\"https://papers.labml.ai/paper/1506.02438\">\u5e7f\u4e49\u4f18\u52bf\u4f30\u8ba1</a>\u7684 <a href=\"https://pytorch.org\">PyTorch</a> \u5b9e\u73b0\u3002</p>\n<p>\u4f60\u53ef\u4ee5<a href=\"experiment.html\">\u5728\u8fd9\u91cc</a>\u627e\u5230\u4e00\u4e2a\u4f7f\u7528\u5b83\u7684\u5b9e\u9a8c\u3002</p>\n",
|
||||
"<h1>Generalized Advantage Estimation (GAE)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://arxiv.org/abs/1506.02438\">Generalized Advantage Estimation</a>.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>.</p>\n": "<h1>\u5e7f\u4e49\u4f18\u52bf\u4f30\u8ba1 (GAE)</h1>\n<p>\u8fd9\u662f\u8bba\u6587<a href=\"https://arxiv.org/abs/1506.02438\">\u5e7f\u4e49\u4f18\u52bf\u4f30\u8ba1</a>\u7684 <a href=\"https://pytorch.org\">PyTorch</a> \u5b9e\u73b0\u3002</p>\n<p>\u4f60\u53ef\u4ee5<a href=\"experiment.html\">\u5728\u8fd9\u91cc</a>\u627e\u5230\u4e00\u4e2a\u4f7f\u7528\u5b83\u7684\u5b9e\u9a8c\u3002</p>\n",
|
||||
"<h3>Calculate advantages</h3>\n<span translate=no>_^_0_^_</span><p><span translate=no>_^_1_^_</span> is high bias, low variance, whilst <span translate=no>_^_2_^_</span> is unbiased, high variance.</p>\n<p>We take a weighted average of <span translate=no>_^_3_^_</span> to balance bias and variance. This is called Generalized Advantage Estimation. <span translate=no>_^_4_^_</span> We set <span translate=no>_^_5_^_</span>, this gives clean calculation for <span translate=no>_^_6_^_</span></p>\n<span translate=no>_^_7_^_</span>": "<h3>\u8ba1\u7b97\u4f18\u52bf</h3>\n<span translate=no>_^_0_^_</span><p><span translate=no>_^_1_^_</span>\u662f\u9ad8\u504f\u5dee\uff0c\u4f4e\u65b9\u5dee\uff0c\u800c<span translate=no>_^_2_^_</span>\u65e0\u504f\u5dee\uff0c\u9ad8\u65b9\u5dee\u3002</p>\n<p>\u6211\u4eec\u91c7\u7528\u52a0\u6743\u5e73\u5747\u503c<span translate=no>_^_3_^_</span>\u6765\u5e73\u8861\u504f\u5dee\u548c\u65b9\u5dee\u3002\u8fd9\u79f0\u4e3a\u5e7f\u4e49\u4f18\u52bf\u4f30\u8ba1\u3002<span translate=no>_^_4_^_</span>\u6211\u4eec\u8bbe\u7f6e<span translate=no>_^_5_^_</span>\uff0c\u8fd9\u7ed9\u51fa\u4e86\u5e72\u51c0\u7684\u8ba1\u7b97<span translate=no>_^_6_^_</span></p>\n<span translate=no>_^_7_^_</span>",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>advantages table </p>\n": "<p>\u4f18\u52bf\u8868</p>\n",
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
{
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">Proximal Policy Optimization - PPO</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://papers.labml.ai/paper/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods one do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a singe sample causes problems because the policy deviates too much producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">here</a>. The experiment uses <a href=\"https://nn.labml.ai/rl/ppo/gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO</a></h1>\n<p><a href=\"https://papers.labml.ai/paper/1707.06347\">\u3053\u308c\u306f\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316</a>\uff08PPO\uff09<a href=\"https://pytorch.org\">\u306ePyTorch\u5b9f\u88c5\u3067\u3059</a>\u3002</p>\n<p>PPO\u306f\u5f37\u5316\u5b66\u7fd2\u306e\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u6cd5\u3067\u3059\u3002\u5358\u7d14\u306a\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30e1\u30bd\u30c3\u30c9\u3067\u306f\u3001\u30b5\u30f3\u30d7\u30eb\uff08\u307e\u305f\u306f\u30b5\u30f3\u30d7\u30eb\u306e\u30bb\u30c3\u30c8\uff09\u3054\u3068\u306b1\u3064\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3044\u307e\u3059\u30021\u3064\u306e\u30b5\u30f3\u30d7\u30eb\u306b\u5bfe\u3057\u3066\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30b9\u30c6\u30c3\u30d7\u3092\u5b9f\u884c\u3059\u308b\u3068\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u504f\u5dee\u304c\u5927\u304d\u3059\u304e\u3066\u4e0d\u9069\u5207\u306a\u30dd\u30ea\u30b7\u30fc\u304c\u751f\u6210\u3055\u308c\u308b\u305f\u3081\u3001\u554f\u984c\u304c\u767a\u751f\u3057\u307e\u3059\u3002PPO \u3067\u306f\u3001\u30dd\u30ea\u30b7\u30fc\u3092\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3057\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u8fd1\u3044\u72b6\u614b\u306b\u4fdd\u3064\u3053\u3068\u3067\u3001\u30b5\u30f3\u30d7\u30eb\u3054\u3068\u306b\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3046\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002\u66f4\u65b0\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u304c\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u5408\u308f\u306a\u3044\u5834\u5408\u306f\u3001\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30d5\u30ed\u30fc\u3092\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u3057\u3066\u66f4\u65b0\u3057\u307e\u3059</p>\u3002\n<p><a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">\u3053\u308c\u3092\u4f7f\u3063\u305f\u5b9f\u9a13\u306f\u3053\u3061\u3089\u304b\u3089\u3054\u89a7\u3044\u305f\u3060\u3051\u307e\u3059</a>\u3002\u3053\u306e\u5b9f\u9a13\u3067\u306f\u3001<a href=\"https://nn.labml.ai/rl/ppo/gae.html\">\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a></p>\u3002\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">Proximal Policy Optimization - PPO</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://arxiv.org/abs/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods one do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a singe sample causes problems because the policy deviates too much producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">here</a>. The experiment uses <a href=\"https://nn.labml.ai/rl/ppo/gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO</a></h1>\n<p><a href=\"https://arxiv.org/abs/1707.06347\">\u3053\u308c\u306f\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316</a>\uff08PPO\uff09<a href=\"https://pytorch.org\">\u306ePyTorch\u5b9f\u88c5\u3067\u3059</a>\u3002</p>\n<p>PPO\u306f\u5f37\u5316\u5b66\u7fd2\u306e\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u6cd5\u3067\u3059\u3002\u5358\u7d14\u306a\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30e1\u30bd\u30c3\u30c9\u3067\u306f\u3001\u30b5\u30f3\u30d7\u30eb\uff08\u307e\u305f\u306f\u30b5\u30f3\u30d7\u30eb\u306e\u30bb\u30c3\u30c8\uff09\u3054\u3068\u306b1\u3064\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3044\u307e\u3059\u30021\u3064\u306e\u30b5\u30f3\u30d7\u30eb\u306b\u5bfe\u3057\u3066\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30b9\u30c6\u30c3\u30d7\u3092\u5b9f\u884c\u3059\u308b\u3068\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u504f\u5dee\u304c\u5927\u304d\u3059\u304e\u3066\u4e0d\u9069\u5207\u306a\u30dd\u30ea\u30b7\u30fc\u304c\u751f\u6210\u3055\u308c\u308b\u305f\u3081\u3001\u554f\u984c\u304c\u767a\u751f\u3057\u307e\u3059\u3002PPO \u3067\u306f\u3001\u30dd\u30ea\u30b7\u30fc\u3092\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3057\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u8fd1\u3044\u72b6\u614b\u306b\u4fdd\u3064\u3053\u3068\u3067\u3001\u30b5\u30f3\u30d7\u30eb\u3054\u3068\u306b\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3046\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002\u66f4\u65b0\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u304c\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u5408\u308f\u306a\u3044\u5834\u5408\u306f\u3001\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30d5\u30ed\u30fc\u3092\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u3057\u3066\u66f4\u65b0\u3057\u307e\u3059</p>\u3002\n<p><a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">\u3053\u308c\u3092\u4f7f\u3063\u305f\u5b9f\u9a13\u306f\u3053\u3061\u3089\u304b\u3089\u3054\u89a7\u3044\u305f\u3060\u3051\u307e\u3059</a>\u3002\u3053\u306e\u5b9f\u9a13\u3067\u306f\u3001<a href=\"https://nn.labml.ai/rl/ppo/gae.html\">\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a></p>\u3002\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"Proximal Policy Optimization - PPO": "\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO"
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
@ -1,4 +1,4 @@
|
||||
{
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">Proximal Policy Optimization - PPO</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://papers.labml.ai/paper/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods one do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a singe sample causes problems because the policy deviates too much producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">here</a>. The experiment uses <a href=\"https://nn.labml.ai/rl/ppo/gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO</a></h1>\n<p>\u8fd9\u662f P <a href=\"https://pytorch.org\">yTorch</a> \u5b9e\u73b0\u7684<a href=\"https://papers.labml.ai/paper/1707.06347\">\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO</a>\u3002</p>\n<p>PPO \u662f\u4e00\u79cd\u7528\u4e8e\u5f3a\u5316\u5b66\u4e60\u7684\u7b56\u7565\u68af\u5ea6\u65b9\u6cd5\u3002\u7b80\u5355\u7684\u7b56\u7565\u68af\u5ea6\u65b9\u6cd5\u53ef\u4ee5\u5bf9\u6bcf\u4e2a\u6837\u672c\uff08\u6216\u4e00\u7ec4\u6837\u672c\uff09\u8fdb\u884c\u4e00\u6b21\u68af\u5ea6\u66f4\u65b0\u3002\u5bf9\u5355\u4e2a\u6837\u672c\u6267\u884c\u591a\u4e2a\u68af\u5ea6\u6b65\u9aa4\u4f1a\u5bfc\u81f4\u95ee\u9898\uff0c\u56e0\u4e3a\u8be5\u7b56\u7565\u504f\u79bb\u5f97\u592a\u5927\uff0c\u4ece\u800c\u4ea7\u751f\u4e86\u9519\u8bef\u7684\u7b56\u7565\u3002PPO \u5141\u8bb8\u6211\u4eec\u5728\u6bcf\u4e2a\u6837\u672c\u4e2d\u8fdb\u884c\u591a\u6b21\u68af\u5ea6\u66f4\u65b0\uff0c\u65b9\u6cd5\u662f\u5c3d\u91cf\u4f7f\u7b56\u7565\u4e0e\u7528\u4e8e\u91c7\u6837\u6570\u636e\u7684\u7b56\u7565\u4fdd\u6301\u4e00\u81f4\u3002\u5982\u679c\u66f4\u65b0\u540e\u7684\u7b56\u7565\u4e0e\u7528\u4e8e\u91c7\u6837\u6570\u636e\u7684\u7b56\u7565\u4e0d\u63a5\u8fd1\uff0c\u5219\u901a\u8fc7\u524a\u51cf\u68af\u5ea6\u6d41\u6765\u5b9e\u73b0\u6b64\u76ee\u7684\u3002</p>\n<p>\u4f60\u53ef\u4ee5<a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">\u5728\u8fd9\u91cc</a>\u627e\u5230\u4e00\u4e2a\u4f7f\u7528\u5b83\u7684\u5b9e\u9a8c\u3002\u8be5\u5b9e\u9a8c\u4f7f\u7528<a href=\"https://nn.labml.ai/rl/ppo/gae.html\">\u5e7f\u4e49\u4f18\u52bf\u4f30\u8ba1</a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">Proximal Policy Optimization - PPO</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://arxiv.org/abs/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods one do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a singe sample causes problems because the policy deviates too much producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">here</a>. The experiment uses <a href=\"https://nn.labml.ai/rl/ppo/gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO</a></h1>\n<p>\u8fd9\u662f P <a href=\"https://pytorch.org\">yTorch</a> \u5b9e\u73b0\u7684<a href=\"https://arxiv.org/abs/1707.06347\">\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO</a>\u3002</p>\n<p>PPO \u662f\u4e00\u79cd\u7528\u4e8e\u5f3a\u5316\u5b66\u4e60\u7684\u7b56\u7565\u68af\u5ea6\u65b9\u6cd5\u3002\u7b80\u5355\u7684\u7b56\u7565\u68af\u5ea6\u65b9\u6cd5\u53ef\u4ee5\u5bf9\u6bcf\u4e2a\u6837\u672c\uff08\u6216\u4e00\u7ec4\u6837\u672c\uff09\u8fdb\u884c\u4e00\u6b21\u68af\u5ea6\u66f4\u65b0\u3002\u5bf9\u5355\u4e2a\u6837\u672c\u6267\u884c\u591a\u4e2a\u68af\u5ea6\u6b65\u9aa4\u4f1a\u5bfc\u81f4\u95ee\u9898\uff0c\u56e0\u4e3a\u8be5\u7b56\u7565\u504f\u79bb\u5f97\u592a\u5927\uff0c\u4ece\u800c\u4ea7\u751f\u4e86\u9519\u8bef\u7684\u7b56\u7565\u3002PPO \u5141\u8bb8\u6211\u4eec\u5728\u6bcf\u4e2a\u6837\u672c\u4e2d\u8fdb\u884c\u591a\u6b21\u68af\u5ea6\u66f4\u65b0\uff0c\u65b9\u6cd5\u662f\u5c3d\u91cf\u4f7f\u7b56\u7565\u4e0e\u7528\u4e8e\u91c7\u6837\u6570\u636e\u7684\u7b56\u7565\u4fdd\u6301\u4e00\u81f4\u3002\u5982\u679c\u66f4\u65b0\u540e\u7684\u7b56\u7565\u4e0e\u7528\u4e8e\u91c7\u6837\u6570\u636e\u7684\u7b56\u7565\u4e0d\u63a5\u8fd1\uff0c\u5219\u901a\u8fc7\u524a\u51cf\u68af\u5ea6\u6d41\u6765\u5b9e\u73b0\u6b64\u76ee\u7684\u3002</p>\n<p>\u4f60\u53ef\u4ee5<a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">\u5728\u8fd9\u91cc</a>\u627e\u5230\u4e00\u4e2a\u4f7f\u7528\u5b83\u7684\u5b9e\u9a8c\u3002\u8be5\u5b9e\u9a8c\u4f7f\u7528<a href=\"https://nn.labml.ai/rl/ppo/gae.html\">\u5e7f\u4e49\u4f18\u52bf\u4f30\u8ba1</a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"Proximal Policy Optimization - PPO": "\u8fd1\u7aef\u7b56\u7565\u4f18\u5316-PPO"
|
||||
}
|
||||
Reference in New Issue
Block a user