mirror of
https://github.com/labmlai/annotated_deep_learning_paper_implementations.git
synced 2025-11-02 21:40:15 +08:00
ja translation
This commit is contained in:
5
translate_cache/rl/__init__.ja.json
Normal file
5
translate_cache/rl/__init__.ja.json
Normal file
@ -0,0 +1,5 @@
|
||||
{
|
||||
"<h1>Reinforcement Learning Algorithms</h1>\n<ul><li><a href=\"ppo\">Proximal Policy Optimization</a> </li>\n<li><a href=\"ppo/experiment.html\">This is an experiment</a> that runs a PPO agent on Atari Breakout. </li>\n<li><a href=\"ppo/gae.html\">Generalized advantage estimation</a> </li>\n<li><a href=\"dqn\">Deep Q Networks</a> </li>\n<li><a href=\"dqn/experiment.html\">This is an experiment</a> that runs a DQN agent on Atari Breakout. </li>\n<li><a href=\"dqn/model.html\">Model</a> with dueling network </li>\n<li><a href=\"dqn/replay_buffer.html\">Prioritized Experience Replay Buffer</a></li></ul>\n<p><a href=\"game.html\">This is the implementation for OpenAI game wrapper</a> using <span translate=no>_^_0_^_</span>.</p>\n": "<h1>\u5f37\u5316\u5b66\u7fd2\u30a2\u30eb\u30b4\u30ea\u30ba\u30e0</h1>\n<ul><li><a href=\"ppo\">\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316</a></li>\n<li><a href=\"ppo/experiment.html\">\u3053\u308c\u306f\u3001\u30a2\u30bf\u30ea\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u3067 PPO \u30a8\u30fc\u30b8\u30a7\u30f3\u30c8\u3092\u5b9f\u884c\u3059\u308b\u5b9f\u9a13\u3067\u3059</a>\u3002</li>\n<li><a href=\"ppo/gae.html\">\u4e00\u822c\u5316\u3055\u308c\u305f\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a</a></li>\n<li><a href=\"dqn\">\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af</a></li>\n<li><a href=\"dqn/experiment.html\">\u3053\u308c\u306f\u3001Atari \u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u3067 DQN \u30a8\u30fc\u30b8\u30a7\u30f3\u30c8\u3092\u5b9f\u884c\u3059\u308b\u5b9f\u9a13\u3067\u3059</a>\u3002</li>\n<li><a href=\"dqn/model.html\">\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u642d\u8f09\u30e2\u30c7\u30eb</a></li>\n<li><a href=\"dqn/replay_buffer.html\">\u512a\u5148\u4f53\u9a13\u30ea\u30d7\u30ec\u30a4\u30d0\u30c3\u30d5\u30a1</a></li></ul>\n<p><a href=\"game.html\">OpenAI \u30b2\u30fc\u30e0\u30e9\u30c3\u30d1\u30fc\u3092\u4f7f\u7528\u3059\u308b\u5834\u5408\u306e\u5b9f\u88c5\u3067\u3059</a>\u3002<span translate=no>_^_0_^_</span></p>\n",
|
||||
"Reinforcement Learning Algorithms": "\u5f37\u5316\u5b66\u7fd2\u30a2\u30eb\u30b4\u30ea\u30ba\u30e0",
|
||||
"This is a collection of PyTorch implementations/tutorials of reinforcement learning algorithms. It currently includes Proximal Policy Optimization, Generalized Advantage Estimation, and Deep Q Networks.": "\u3053\u308c\u306f\u3001\u5f37\u5316\u5b66\u7fd2\u30a2\u30eb\u30b4\u30ea\u30ba\u30e0\u306ePyTorch\u5b9f\u88c5/\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u306e\u30b3\u30ec\u30af\u30b7\u30e7\u30f3\u3067\u3059\u3002\u73fe\u5728\u3001\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316\u3001\u6c4e\u7528\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3001\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u304c\u542b\u307e\u308c\u3066\u3044\u307e\u3059\u3002"
|
||||
}
|
||||
15
translate_cache/rl/dqn/__init__.ja.json
Normal file
15
translate_cache/rl/dqn/__init__.ja.json
Normal file
@ -0,0 +1,15 @@
|
||||
{
|
||||
"<h1>Deep Q Networks (DQN)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"model.html\">Dueling Network</a>, <a href=\"replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"experiment.html\">experiment</a> and <a href=\"model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN)</h1>\n<p>\u3053\u308c\u306f\u3001<a href=\"https://papers.labml.ai/paper/1312.5602\">\u30c7\u30a3\u30fc\u30d7\u5f37\u5316\u5b66\u7fd2\u3092\u4f7f\u3063\u305f\u30a2\u30bf\u30ea\u30d7\u30ec\u30a4\u3068\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af</a><a href=\"model.html\">\u3001<a href=\"replay_buffer.html\">\u512a\u5148\u30ea\u30d7\u30ec\u30a4</a>\u3001<a href=\"https://pytorch.org\">\u30c0\u30d6\u30ebQ\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092PyTorch\u3067\u5b9f\u88c5\u3057\u305f\u3082\u306e\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"experiment.html\"><a href=\"model.html\">\u3053\u308c\u304c\u5b9f\u9a13\u3068\u30e2\u30c7\u30eb\u306e\u5b9f\u88c5\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Train the model</h2>\n<p>We want to find optimal action-value function.</p>\n<span translate=no>_^_0_^_</span><h3>Target network \ud83c\udfaf</h3>\n<p>In order to improve stability we use experience replay that randomly sample from previous experience <span translate=no>_^_1_^_</span>. We also use a Q network with a separate set of parameters <span translate=no>_^_2_^_</span> to calculate the target. <span translate=no>_^_3_^_</span> is updated periodically. This is according to paper <a href=\"https://deepmind.com/research/dqn/\">Human Level Control Through Deep Reinforcement Learning</a>.</p>\n<p>So the loss function is, <span translate=no>_^_4_^_</span></p>\n<h3>Double <span translate=no>_^_5_^_</span>-Learning</h3>\n<p>The max operator in the above calculation uses same network for both selecting the best action and for evaluating the value. That is, <span translate=no>_^_6_^_</span> We use <a href=\"https://papers.labml.ai/paper/1509.06461\">double Q-learning</a>, where the <span translate=no>_^_7_^_</span> is taken from <span translate=no>_^_8_^_</span> and the value is taken from <span translate=no>_^_9_^_</span>.</p>\n<p>And the loss function becomes,</p>\n<span translate=no>_^_10_^_</span>": "<h2>\u30e2\u30c7\u30eb\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0</h2>\n<p>\u6700\u9069\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u30d0\u30ea\u30e5\u30fc\u95a2\u6570\u3092\u898b\u3064\u3051\u305f\u3044\u3002</p>\n<span translate=no>_^_0_^_</span><h3>\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af \ud83c\udfaf</h3>\n<p>\u5b89\u5b9a\u6027\u3092\u5411\u4e0a\u3055\u305b\u308b\u305f\u3081\u306b\u3001\u4ee5\u524d\u306e\u30a8\u30af\u30b9\u30da\u30ea\u30a8\u30f3\u30b9\u304b\u3089\u30e9\u30f3\u30c0\u30e0\u306b\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u308b\u30a8\u30af\u30b9\u30da\u30ea\u30a8\u30f3\u30b9\u306e\u30ea\u30d7\u30ec\u30a4\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059\u3002<span translate=no>_^_1_^_</span>\u307e\u305f\u3001<span translate=no>_^_2_^_</span>\u5225\u306e\u30d1\u30e9\u30e1\u30fc\u30bf\u30bb\u30c3\u30c8\u3092\u6301\u3064Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f7f\u7528\u3057\u3066\u30bf\u30fc\u30b2\u30c3\u30c8\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002<span translate=no>_^_3_^_</span>\u5b9a\u671f\u7684\u306b\u66f4\u65b0\u3055\u308c\u307e\u3059\u3002\u3053\u308c\u306f\u3001<a href=\"https://deepmind.com/research/dqn/\">\u6df1\u5c64\u5f37\u5316\u5b66\u7fd2\u306b\u3088\u308b\u30d2\u30e5\u30fc\u30de\u30f3\u30ec\u30d9\u30eb\u5236\u5fa1\u306e\u8ad6\u6587\u306b\u3088\u308b\u3082\u306e\u3067\u3059</a></p>\u3002\n<p>\u3057\u305f\u304c\u3063\u3066\u3001\u640d\u5931\u95a2\u6570\u306f\u3001<span translate=no>_^_4_^_</span></p>\n<h3><span translate=no>_^_5_^_</span>\u30c0\u30d6\u30eb\u30e9\u30fc\u30cb\u30f3\u30b0</h3>\n<p>\u4e0a\u306e\u8a08\u7b97\u306e max \u6f14\u7b97\u5b50\u306f\u3001\u6700\u9069\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u306e\u9078\u629e\u3068\u5024\u306e\u8a55\u4fa1\u306e\u4e21\u65b9\u306b\u540c\u3058\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f7f\u7528\u3057\u307e\u3059\u3002\u3064\u307e\u308a<span translate=no>_^_6_^_</span>\u3001<a href=\"https://papers.labml.ai/paper/1509.06461\"><span translate=no>_^_7_^_</span><span translate=no>_^_8_^_</span>\u306e\u53d6\u5f97\u5143\u3068\u5024\u306e\u53d6\u5f97\u5143\u3068\u3044\u3046\u4e8c\u91cdQ\u30e9\u30fc\u30cb\u30f3\u30b0\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a></p>\u3002<span translate=no>_^_9_^_</span>\n<p>\u305d\u3057\u3066\u3001\u640d\u5931\u95a2\u6570\u306f\u6b21\u306e\u3088\u3046\u306b\u306a\u308a\u307e\u3059\u3002</p>\n<span translate=no>_^_10_^_</span>",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Calculate the desired Q value. We multiply by <span translate=no>_^_0_^_</span> to zero out the next state Q values if the game ended.</p>\n<p><span translate=no>_^_1_^_</span> </p>\n": "<p>\u76ee\u7684\u306e Q \u5024\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002\u30b2\u30fc\u30e0\u304c\u7d42\u4e86\u3057\u305f\u3089<span translate=no>_^_0_^_</span>\u3001\u3092\u639b\u3051\u3066\u6b21\u306e\u30b9\u30c6\u30fc\u30c8\u306eQ\u5024\u3092\u30bc\u30ed\u306b\u3057\u307e\u3059</p>\u3002\n<p><span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>Get the best action at state <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span> </p>\n": "<p>\u5dde\u3067\u6700\u9ad8\u306e\u30a2\u30af\u30b7\u30e7\u30f3\u3092 <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>Get the q value from the target network for the best action at state <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span> </p>\n": "<p>\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u304b\u3089 q \u5024\u3092\u53d6\u5f97\u3057\u3066\u3001\u72b6\u614b\u3067\u306e\u6700\u9069\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u73fe\u3059\u308b <span translate=no>_^_0_^_</span> <span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>Get weighted means </p>\n": "<p>\u52a0\u91cd\u5e73\u5747\u3092\u53d6\u5f97</p>\n",
|
||||
"<p>Gradients shouldn't propagate gradients <span translate=no>_^_0_^_</span> </p>\n": "<p>\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306f\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u3092\u4f1d\u64ad\u3057\u3066\u306f\u3044\u3051\u307e\u305b\u3093 <span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Temporal difference error <span translate=no>_^_0_^_</span> is used to weigh samples in replay buffer </p>\n": "<p><span translate=no>_^_0_^_</span>\u6642\u9593\u5dee\u30a8\u30e9\u30fc\u306f\u30ea\u30d7\u30ec\u30a4\u30d0\u30c3\u30d5\u30a1\u5185\u306e\u30b5\u30f3\u30d7\u30eb\u306e\u91cd\u307f\u4ed8\u3051\u306b\u4f7f\u7528\u3055\u308c\u307e\u3059</p>\n",
|
||||
"<p>We take <a href=\"https://en.wikipedia.org/wiki/Huber_loss\">Huber loss</a> instead of mean squared error loss because it is less sensitive to outliers </p>\n": "<p>\u5916\u308c\u5024\u306e\u5f71\u97ff\u3092\u53d7\u3051\u306b\u304f\u3044\u306e\u3067\u3001<a href=\"https://en.wikipedia.org/wiki/Huber_loss\">\u5e73\u5747\u4e8c\u4e57\u8aa4\u5dee\u640d\u5931\u306e\u4ee3\u308f\u308a\u306b\u30d5\u30fc\u30d0\u30fc\u640d\u5931\u3092\u4f7f\u7528\u3057\u307e\u3059</a>\u3002</p>\n",
|
||||
"<ul><li><span translate=no>_^_0_^_</span> - <span translate=no>_^_1_^_</span> </li>\n<li><span translate=no>_^_2_^_</span> - <span translate=no>_^_3_^_</span> </li>\n<li><span translate=no>_^_4_^_</span> - <span translate=no>_^_5_^_</span> </li>\n<li><span translate=no>_^_6_^_</span> - <span translate=no>_^_7_^_</span> </li>\n<li><span translate=no>_^_8_^_</span> - whether the game ended after taking the action </li>\n<li><span translate=no>_^_9_^_</span> - <span translate=no>_^_10_^_</span> </li>\n<li><span translate=no>_^_11_^_</span> - weights of the samples from prioritized experienced replay</li></ul>\n": "<ul><li><span translate=no>_^_0_^_</span>-<span translate=no>_^_1_^_</span></li>\n<li><span translate=no>_^_2_^_</span>-<span translate=no>_^_3_^_</span></li>\n<li><span translate=no>_^_4_^_</span>-<span translate=no>_^_5_^_</span></li>\n<li><span translate=no>_^_6_^_</span>-<span translate=no>_^_7_^_</span></li>\n<li><span translate=no>_^_8_^_</span>-\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c\u3057\u305f\u5f8c\u306b\u30b2\u30fc\u30e0\u304c\u7d42\u4e86\u3057\u305f\u304b\u3069\u3046\u304b</li>\n<li><span translate=no>_^_9_^_</span>-<span translate=no>_^_10_^_</span></li>\n<li><span translate=no>_^_11_^_</span>-\u7d4c\u9a13\u8c4a\u304b\u306a\u30ea\u30d7\u30ec\u30a4\u3092\u512a\u5148\u3057\u3066\u62bd\u51fa\u3057\u305f\u30b5\u30f3\u30d7\u30eb\u306e\u91cd\u307f</li></ul>\n",
|
||||
"Deep Q Networks (DQN)": "\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN)",
|
||||
"This is a PyTorch implementation/tutorial of Deep Q Networks (DQN) from paper Playing Atari with Deep Reinforcement Learning. This includes dueling network architecture, a prioritized replay buffer and double-Q-network training.": "\u3053\u308c\u306f\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\uff08DQN\uff09\u306ePyTorch\u5b9f\u88c5/\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u3067\u3001\u300c\u30c7\u30a3\u30fc\u30d7\u5f37\u5316\u5b66\u7fd2\u3067\u30a2\u30bf\u30ea\u3092\u30d7\u30ec\u30a4\u300d\u3068\u3044\u3046\u8ad6\u6587\u304b\u3089\u5f15\u7528\u3057\u3066\u3044\u307e\u3059\u3002\u3053\u308c\u306b\u306f\u3001\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u30a2\u30fc\u30ad\u30c6\u30af\u30c1\u30e3\u3001\u512a\u5148\u9806\u4f4d\u4ed8\u3051\u3055\u308c\u305f\u30ea\u30d7\u30ec\u30a4\u30d0\u30c3\u30d5\u30a1\u3001\u30c0\u30d6\u30ebQ\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u304c\u542b\u307e\u308c\u307e\u3059\u3002"
|
||||
}
|
||||
81
translate_cache/rl/dqn/experiment.ja.json
Normal file
81
translate_cache/rl/dqn/experiment.ja.json
Normal file
@ -0,0 +1,81 @@
|
||||
{
|
||||
"<h1>DQN Experiment with Atari Breakout</h1>\n<p>This experiment trains a Deep Q Network (DQN) to play Atari Breakout game on OpenAI Gym. It runs the <a href=\"../game.html\">game environments on multiple processes</a> to sample efficiently.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u30a2\u30bf\u30ea\u30fb\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u306b\u3088\u308bDQN\u5b9f\u9a13</h1>\n<p>\u3053\u306e\u5b9f\u9a13\u3067\u306f\u3001\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\uff08DQN\uff09\u306bOpenAI Gym\u3067\u30a2\u30bf\u30ea\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u30b2\u30fc\u30e0\u3092\u30d7\u30ec\u30a4\u3059\u308b\u3088\u3046\u306b\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3057\u307e\u3059\u3002<a href=\"../game.html\">\u30b2\u30fc\u30e0\u74b0\u5883\u3092\u8907\u6570\u306e\u30d7\u30ed\u30bb\u30b9\u3067\u5b9f\u884c\u3057\u3066\u52b9\u7387\u7684\u306b\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3057\u307e\u3059</a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Run it</h2>\n": "<h2>\u5b9f\u884c\u3057\u3066\u304f\u3060\u3055\u3044</h2>\n",
|
||||
"<h2>Trainer</h2>\n": "<h2>\u30c8\u30ec\u30fc\u30ca\u30fc</h2>\n",
|
||||
"<h3>Destroy</h3>\n<p>Stop the workers</p>\n": "<h3>\u7834\u58ca</h3>\n<p>\u52b4\u50cd\u8005\u3092\u6b62\u3081\u308d</p>\n",
|
||||
"<h3>Run training loop</h3>\n": "<h3>\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u30eb\u30fc\u30d7\u3092\u5b9f\u884c</h3>\n",
|
||||
"<h3>Sample data</h3>\n": "<h3>\u30b5\u30f3\u30d7\u30eb\u30c7\u30fc\u30bf</h3>\n",
|
||||
"<h3>Train the model</h3>\n": "<h3>\u30e2\u30c7\u30eb\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0</h3>\n",
|
||||
"<h4><span translate=no>_^_0_^_</span>-greedy Sampling</h4>\n<p>When sampling actions we use a <span translate=no>_^_1_^_</span>-greedy strategy, where we take a greedy action with probabiliy <span translate=no>_^_2_^_</span> and take a random action with probability <span translate=no>_^_3_^_</span>. We refer to <span translate=no>_^_4_^_</span> as <span translate=no>_^_5_^_</span>.</p>\n": "<h4><span translate=no>_^_0_^_</span>-\u8caa\u6b32\u306a\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0</h4>\n<p>\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3059\u308b\u3068\u304d\u306f\u3001<span translate=no>_^_1_^_</span>-greedy \u30b9\u30c8\u30e9\u30c6\u30b8\u30fc\u3092\u4f7f\u7528\u3057\u307e\u3059\u3002\u3064\u307e\u308a\u3001<span translate=no>_^_2_^_</span>\u78ba\u7387\u306e\u3042\u308b\u8caa\u6b32\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c\u3057\u3001\u78ba\u7387\u306e\u3042\u308b\u30e9\u30f3\u30c0\u30e0\u306a\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c\u3057\u307e\u3059\u3002<span translate=no>_^_3_^_</span><span translate=no>_^_4_^_</span>\u3068\u547c\u3073\u307e\u3059<span translate=no>_^_5_^_</span>\u3002</p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span> for prioritized replay </p>\n": "<p><span translate=no>_^_0_^_</span>\u512a\u5148\u518d\u751f\u7528</p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span> for replay buffer as a function of updates </p>\n": "<p><span translate=no>_^_0_^_</span>\u66f4\u65b0\u6a5f\u80fd\u3068\u3057\u3066\u306e\u518d\u751f\u30d0\u30c3\u30d5\u30a1\u7528</p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span>, exploration fraction </p>\n": "<p><span translate=no>_^_0_^_</span>\u3001\u63a2\u67fb\u30d5\u30e9\u30af\u30b7\u30e7\u30f3</p>\n",
|
||||
"<p>Add a new line to the screen periodically </p>\n": "<p>\u753b\u9762\u306b\u5b9a\u671f\u7684\u306b\u65b0\u3057\u3044\u884c\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044</p>\n",
|
||||
"<p>Add transition to replay buffer </p>\n": "<p>\u518d\u751f\u30d0\u30c3\u30d5\u30a1\u306b\u30c8\u30e9\u30f3\u30b8\u30b7\u30e7\u30f3\u3092\u8ffd\u52a0</p>\n",
|
||||
"<p>Calculate gradients </p>\n": "<p>\u52fe\u914d\u306e\u8a08\u7b97</p>\n",
|
||||
"<p>Calculate priorities for replay buffer <span translate=no>_^_0_^_</span> </p>\n": "<p>\u518d\u751f\u30d0\u30c3\u30d5\u30a1\u306e\u512a\u5148\u5ea6\u3092\u8a08\u7b97 <span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Clip gradients </p>\n": "<p>\u30af\u30ea\u30c3\u30d7\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3</p>\n",
|
||||
"<p>Collect information from each worker </p>\n": "<p>\u5404\u4f5c\u696d\u8005\u304b\u3089\u60c5\u5831\u3092\u53ce\u96c6\u3059\u308b</p>\n",
|
||||
"<p>Compute Temporal Difference (TD) errors, <span translate=no>_^_0_^_</span>, and the loss, <span translate=no>_^_1_^_</span>. </p>\n": "<p>\u6642\u5dee (TD) \u8aa4\u5dee<span translate=no>_^_0_^_</span>\u3001\u304a\u3088\u3073\u640d\u5931\u3092\u8a08\u7b97\u3057\u307e\u3059\u3002<span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>Configurations </p>\n": "<p>\u30b3\u30f3\u30d5\u30a3\u30ae\u30e5\u30ec\u30fc\u30b7\u30e7\u30f3</p>\n",
|
||||
"<p>Copy to target network initially </p>\n": "<p>\u6700\u521d\u306b\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u306b\u30b3\u30d4\u30fc</p>\n",
|
||||
"<p>Create the experiment </p>\n": "<p>\u5b9f\u9a13\u3092\u4f5c\u6210</p>\n",
|
||||
"<p>Get <span translate=no>_^_0_^_</span> </p>\n": "<p>\u53d6\u5f97 <span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Get Q_values for the current observation </p>\n": "<p>\u73fe\u5728\u306e\u89b3\u6e2c\u5024\u306e Q_value \u3092\u53d6\u5f97</p>\n",
|
||||
"<p>Get results after executing the actions </p>\n": "<p>\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c\u3057\u305f\u5f8c\u306b\u7d50\u679c\u3092\u53d6\u5f97</p>\n",
|
||||
"<p>Get the Q-values of the next state for <a href=\"index.html\">Double Q-learning</a>. Gradients shouldn't propagate for these </p>\n": "<p><a href=\"index.html\">\u4e8c\u91cdQ\u5b66\u7fd2\u306e\u6b21\u306e\u72b6\u614b\u306eQ\u5024\u3092\u53d6\u5f97\u3057\u307e\u3059</a>\u3002\u3053\u308c\u3089\u306e\u5834\u5408\u3001\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306f\u4f1d\u64ad\u3057\u306a\u3044\u306f\u305a\u3067\u3059</p>\n",
|
||||
"<p>Get the predicted Q-value </p>\n": "<p>\u4e88\u6e2c\u3055\u308c\u305f Q \u5024\u306e\u53d6\u5f97</p>\n",
|
||||
"<p>Initialize the trainer </p>\n": "<p>\u30c8\u30ec\u30fc\u30ca\u30fc\u3092\u521d\u671f\u5316</p>\n",
|
||||
"<p>Last 100 episode information </p>\n": "<p>\u6700\u65b0100\u8a71\u306e\u60c5\u5831</p>\n",
|
||||
"<p>Learning rate. </p>\n": "<p>\u5b66\u7fd2\u7387\u3002</p>\n",
|
||||
"<p>Mini batch size </p>\n": "<p>\u30df\u30cb\u30d0\u30c3\u30c1\u30b5\u30a4\u30ba</p>\n",
|
||||
"<p>Model for sampling and training </p>\n": "<p>\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3068\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u7528\u306e\u30e2\u30c7\u30eb</p>\n",
|
||||
"<p>Number of epochs to train the model with sampled data. </p>\n": "<p>\u30b5\u30f3\u30d7\u30eb\u30c7\u30fc\u30bf\u3092\u4f7f\u7528\u3057\u3066\u30e2\u30c7\u30eb\u3092\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3059\u308b\u30a8\u30dd\u30c3\u30af\u306e\u6570\u3002</p>\n",
|
||||
"<p>Number of steps to run on each process for a single update </p>\n": "<p>1 \u56de\u306e\u66f4\u65b0\u3067\u5404\u30d7\u30ed\u30bb\u30b9\u3067\u5b9f\u884c\u3059\u308b\u30b9\u30c6\u30c3\u30d7\u306e\u6570</p>\n",
|
||||
"<p>Number of updates </p>\n": "<p>\u66f4\u65b0\u56de\u6570</p>\n",
|
||||
"<p>Number of worker processes </p>\n": "<p>\u30ef\u30fc\u30ab\u30fc\u30d7\u30ed\u30bb\u30b9\u306e\u6570</p>\n",
|
||||
"<p>Periodically update target network </p>\n": "<p>\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u5b9a\u671f\u7684\u306b\u66f4\u65b0</p>\n",
|
||||
"<p>Pick the action based on <span translate=no>_^_0_^_</span> </p>\n": "<p>\u4ee5\u4e0b\u306b\u57fa\u3065\u3044\u3066\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u9078\u629e\u3057\u3066\u304f\u3060\u3055\u3044 <span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Replay buffer with <span translate=no>_^_0_^_</span>. Capacity of the replay buffer must be a power of 2. </p>\n": "<p>\u30ea\u30d7\u30ec\u30a4\u30d0\u30c3\u30d5\u30a1\u306f<span translate=no>_^_0_^_</span>.\u518d\u751f\u30d0\u30c3\u30d5\u30a1\u306e\u5bb9\u91cf\u306f 2 \u306e\u7d2f\u4e57\u3067\u306a\u3051\u308c\u3070\u306a\u308a\u307e\u305b\u3093</p>\u3002\n",
|
||||
"<p>Run and monitor the experiment </p>\n": "<p>\u5b9f\u9a13\u306e\u5b9f\u884c\u3068\u76e3\u8996</p>\n",
|
||||
"<p>Run sampled actions on each worker </p>\n": "<p>\u5404\u30ef\u30fc\u30ab\u30fc\u3067\u30b5\u30f3\u30d7\u30eb\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c</p>\n",
|
||||
"<p>Sample <span translate=no>_^_0_^_</span> </p>\n": "<p>[\u30b5\u30f3\u30d7\u30eb] <span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Sample actions </p>\n": "<p>\u30b5\u30f3\u30d7\u30eb\u30a2\u30af\u30b7\u30e7\u30f3</p>\n",
|
||||
"<p>Sample from priority replay buffer </p>\n": "<p>\u30d7\u30e9\u30a4\u30aa\u30ea\u30c6\u30a3\u30fb\u30ea\u30d7\u30ec\u30a4\u30fb\u30d0\u30c3\u30d5\u30a1\u304b\u3089\u306e\u30b5\u30f3\u30d7\u30eb</p>\n",
|
||||
"<p>Sample the action with highest Q-value. This is the greedy action. </p>\n": "<p>Q\u5024\u304c\u6700\u3082\u9ad8\u3044\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3057\u307e\u3059\u3002\u3053\u308c\u306f\u8caa\u6b32\u306a\u884c\u52d5\u3067\u3059</p>\u3002\n",
|
||||
"<p>Sample with current policy </p>\n": "<p>\u73fe\u5728\u306e\u30dd\u30ea\u30b7\u30fc\u3092\u542b\u3080\u30b5\u30f3\u30d7\u30eb</p>\n",
|
||||
"<p>Sampling doesn't need gradients </p>\n": "<p>\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u306f\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306f\u5fc5\u8981\u3042\u308a\u307e\u305b\u3093</p>\n",
|
||||
"<p>Save tracked indicators. </p>\n": "<p>\u8ffd\u8de1\u6307\u6a19\u3092\u4fdd\u5b58\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p>Scale observations from <span translate=no>_^_0_^_</span> to <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u89b3\u6e2c\u5024\u3092\u304b\u3089\u306b\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0 <span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>Select device </p>\n": "<p>\u30c7\u30d0\u30a4\u30b9\u3092\u9078\u629e</p>\n",
|
||||
"<p>Set learning rate </p>\n": "<p>\u5b66\u7fd2\u7387\u3092\u8a2d\u5b9a</p>\n",
|
||||
"<p>Start training after the buffer is full </p>\n": "<p>\u30d0\u30c3\u30d5\u30a1\u30fc\u304c\u3044\u3063\u3071\u3044\u306b\u306a\u3063\u305f\u3089\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3092\u958b\u59cb\u3059\u308b</p>\n",
|
||||
"<p>Stop the workers </p>\n": "<p>\u52b4\u50cd\u8005\u3092\u6b62\u3081\u308d</p>\n",
|
||||
"<p>Target model updating interval </p>\n": "<p>\u5bfe\u8c61\u30e2\u30c7\u30eb\u306e\u66f4\u65b0\u9593\u9694</p>\n",
|
||||
"<p>This doesn't need gradients </p>\n": "<p>\u3053\u308c\u306b\u306f\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306f\u5fc5\u8981\u3042\u308a\u307e\u305b\u3093</p>\n",
|
||||
"<p>Train the model </p>\n": "<p>\u30e2\u30c7\u30eb\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0</p>\n",
|
||||
"<p>Uniformly sample and action </p>\n": "<p>\u30b5\u30f3\u30d7\u30eb\u3068\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5747\u4e00\u306b</p>\n",
|
||||
"<p>Update parameters based on gradients </p>\n": "<p>\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306b\u57fa\u3065\u3044\u3066\u30d1\u30e9\u30e1\u30fc\u30bf\u3092\u66f4\u65b0</p>\n",
|
||||
"<p>Update replay buffer priorities </p>\n": "<p>\u30ea\u30d7\u30ec\u30a4\u30d0\u30c3\u30d5\u30a1\u306e\u512a\u5148\u9806\u4f4d\u3092\u66f4\u65b0</p>\n",
|
||||
"<p>Whether to chose greedy action or the random action </p>\n": "<p>\u6b32\u5f35\u308a\u30a2\u30af\u30b7\u30e7\u30f3\u3068\u30e9\u30f3\u30c0\u30e0\u30a2\u30af\u30b7\u30e7\u30f3\u306e\u3069\u3061\u3089\u3092\u9078\u3076\u304b</p>\n",
|
||||
"<p>Zero out the previously calculated gradients </p>\n": "<p>\u4ee5\u524d\u306b\u8a08\u7b97\u3057\u305f\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u3092\u30bc\u30ed\u306b\u3057\u307e\u3059</p>\n",
|
||||
"<p>create workers </p>\n": "<p>\u30ef\u30fc\u30ab\u30fc\u3092\u4f5c\u6210</p>\n",
|
||||
"<p>exploration as a function of updates </p>\n": "<p>\u66f4\u65b0\u6a5f\u80fd\u3068\u3057\u3066\u306e\u63a2\u7d22</p>\n",
|
||||
"<p>get the initial observations </p>\n": "<p>\u521d\u671f\u89b3\u6e2c\u5024\u3092\u53d6\u5f97</p>\n",
|
||||
"<p>initialize tensors for observations </p>\n": "<p>\u89b3\u6e2c\u7528\u306e\u30c6\u30f3\u30bd\u30eb\u3092\u521d\u671f\u5316</p>\n",
|
||||
"<p>learning rate </p>\n": "<p>\u5b66\u7fd2\u7387</p>\n",
|
||||
"<p>loss function </p>\n": "<p>\u640d\u5931\u95a2\u6570</p>\n",
|
||||
"<p>number of training iterations </p>\n": "<p>\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u306e\u53cd\u5fa9\u56de\u6570</p>\n",
|
||||
"<p>number of updates </p>\n": "<p>\u66f4\u65b0\u56de\u6570</p>\n",
|
||||
"<p>number of workers </p>\n": "<p>\u52b4\u50cd\u8005\u306e\u6570</p>\n",
|
||||
"<p>optimizer </p>\n": "<p>\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc</p>\n",
|
||||
"<p>reset the workers </p>\n": "<p>\u30ef\u30fc\u30ab\u30fc\u3092\u30ea\u30bb\u30c3\u30c8</p>\n",
|
||||
"<p>size of mini batch for training </p>\n": "<p>\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u7528\u30df\u30cb\u30d0\u30c3\u30c1\u306e\u30b5\u30a4\u30ba</p>\n",
|
||||
"<p>steps sampled on each update </p>\n": "<p>\u66f4\u65b0\u306e\u305f\u3073\u306b\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u308b\u30b9\u30c6\u30c3\u30d7</p>\n",
|
||||
"<p>target model to get <span translate=no>_^_0_^_</span> </p>\n": "<p>\u53d6\u5f97\u3059\u308b\u5bfe\u8c61\u30e2\u30c7\u30eb <span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>update current observation </p>\n": "<p>\u73fe\u5728\u306e\u89b3\u6e2c\u5024\u3092\u66f4\u65b0</p>\n",
|
||||
"<p>update episode information. collect episode info, which is available if an episode finished; this includes total reward and length of the episode - look at <span translate=no>_^_0_^_</span> to see how it works. </p>\n": "<p>\u30a8\u30d4\u30bd\u30fc\u30c9\u60c5\u5831\u3092\u66f4\u65b0\u3057\u307e\u3059\u3002\u30a8\u30d4\u30bd\u30fc\u30c9\u304c\u7d42\u4e86\u3057\u305f\u5834\u5408\u306b\u5229\u7528\u3067\u304d\u308b\u30a8\u30d4\u30bd\u30fc\u30c9\u60c5\u5831\u3092\u53ce\u96c6\u3057\u307e\u3059\u3002\u3053\u308c\u306b\u306f\u3001\u5408\u8a08\u5831\u916c\u3068\u30a8\u30d4\u30bd\u30fc\u30c9\u306e\u9577\u3055\u304c\u542b\u307e\u308c\u307e\u3059\u3002\u4ed5\u7d44\u307f\u3092\u78ba\u8a8d\u3057\u3066\u307f\u3066\u304f\u3060\u3055\u3044\u3002<span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>update target network every 250 update </p>\n": "<p>250 \u56de\u306e\u66f4\u65b0\u3054\u3068\u306b\u30bf\u30fc\u30b2\u30c3\u30c8\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u66f4\u65b0</p>\n",
|
||||
"DQN Experiment with Atari Breakout": "\u30a2\u30bf\u30ea\u30fb\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u306b\u3088\u308bDQN\u5b9f\u9a13",
|
||||
"Implementation of DQN experiment with Atari Breakout": "\u30a2\u30bf\u30ea\u30fb\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u306b\u3088\u308bDQN\u5b9f\u9a13\u306e\u5b9f\u65bd"
|
||||
}
|
||||
16
translate_cache/rl/dqn/model.ja.json
Normal file
16
translate_cache/rl/dqn/model.ja.json
Normal file
@ -0,0 +1,16 @@
|
||||
{
|
||||
"<h1>Deep Q Network (DQN) Model</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN) \u30e2\u30c7\u30eb</h1>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Dueling Network \u2694\ufe0f Model for <span translate=no>_^_0_^_</span> Values</h2>\n<p>We are using a <a href=\"https://papers.labml.ai/paper/1511.06581\">dueling network</a> to calculate Q-values. Intuition behind dueling network architecture is that in most states the action doesn't matter, and in some states the action is significant. Dueling network allows this to be represented very well.</p>\n<span translate=no>_^_1_^_</span><p>So we create two networks for <span translate=no>_^_2_^_</span> and <span translate=no>_^_3_^_</span> and get <span translate=no>_^_4_^_</span> from them. <span translate=no>_^_5_^_</span> We share the initial layers of the <span translate=no>_^_6_^_</span> and <span translate=no>_^_7_^_</span> networks.</p>\n": "<h2>\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af \u2694\ufe0f \u4fa1\u5024\u30e2\u30c7\u30eb <span translate=no>_^_0_^_</span></h2>\n<p><a href=\"https://papers.labml.ai/paper/1511.06581\">Q\u5024\u306e\u8a08\u7b97\u306b\u306f\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a>\u3002\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u30a2\u30fc\u30ad\u30c6\u30af\u30c1\u30e3\u306e\u80cc\u5f8c\u306b\u3042\u308b\u76f4\u611f\u306f\u3001\u307b\u3068\u3093\u3069\u306e\u5dde\u3067\u306f\u30a2\u30af\u30b7\u30e7\u30f3\u306f\u91cd\u8981\u3067\u306f\u306a\u304f\u3001\u4e00\u90e8\u306e\u5dde\u3067\u306f\u30a2\u30af\u30b7\u30e7\u30f3\u304c\u91cd\u8981\u3067\u3042\u308b\u3068\u3044\u3046\u3053\u3068\u3067\u3059\u3002\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3067\u306f\u3001\u3053\u308c\u3092\u975e\u5e38\u306b\u3088\u304f\u8868\u73fe\u3067\u304d\u307e\u3059</p>\u3002\n<span translate=no>_^_1_^_</span><p>\u305d\u3053\u3067\u3001<span translate=no>_^_2_^_</span><span translate=no>_^_3_^_</span>\u3068\u304b\u3089\u306e 2 \u3064\u306e\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092\u4f5c\u6210\u3057\u3066\u3001\u305d\u306e 2 <span translate=no>_^_4_^_</span> \u3064\u306e\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u304b\u3089\u53d6\u5f97\u3057\u307e\u3059\u3002<span translate=no>_^_5_^_</span><span translate=no>_^_6_^_</span><span translate=no>_^_7_^_</span>\u3068\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u306e\u521d\u671f\u30ec\u30a4\u30e4\u30fc\u3092\u5171\u6709\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>A fully connected layer takes the flattened frame from third convolution layer, and outputs <span translate=no>_^_0_^_</span> features </p>\n": "<p>\u5b8c\u5168\u306b\u63a5\u7d9a\u3055\u308c\u305f\u30ec\u30a4\u30e4\u30fc\u306f\u30013 \u756a\u76ee\u306e\u30b3\u30f3\u30dc\u30ea\u30e5\u30fc\u30b7\u30e7\u30f3\u30ec\u30a4\u30e4\u30fc\u304b\u3089\u30d5\u30e9\u30c3\u30c8\u5316\u3055\u308c\u305f\u30d5\u30ec\u30fc\u30e0\u3092\u53d6\u308a\u51fa\u3057\u3001\u30d5\u30a3\u30fc\u30c1\u30e3\u3092\u51fa\u529b\u3057\u307e\u3059\u3002<span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Convolution </p>\n": "<p>\u30b3\u30f3\u30dc\u30ea\u30e5\u30fc\u30b7\u30e7\u30f3</p>\n",
|
||||
"<p>Linear layer </p>\n": "<p>\u30ea\u30cb\u30a2\u30ec\u30a4\u30e4\u30fc</p>\n",
|
||||
"<p>Reshape for linear layers </p>\n": "<p>\u7dda\u5f62\u30ec\u30a4\u30e4\u30fc\u306e\u5f62\u72b6\u3092\u5909\u66f4</p>\n",
|
||||
"<p>The first convolution layer takes a <span translate=no>_^_0_^_</span> frame and produces a <span translate=no>_^_1_^_</span> frame </p>\n": "<p><span translate=no>_^_0_^_</span>\u6700\u521d\u306e\u7573\u307f\u8fbc\u307f\u5c64\u306f\u30d5\u30ec\u30fc\u30e0\u3092\u53d6\u308a\u3001\u30d5\u30ec\u30fc\u30e0\u3092\u751f\u6210\u3057\u307e\u3059\u3002<span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>The second convolution layer takes a <span translate=no>_^_0_^_</span> frame and produces a <span translate=no>_^_1_^_</span> frame </p>\n": "<p>2 \u756a\u76ee\u306e\u7573\u307f\u8fbc\u307f\u5c64\u306f\u3001<span translate=no>_^_0_^_</span>\u30d5\u30ec\u30fc\u30e0\u3092\u53d6\u5f97\u3057\u3066\u30d5\u30ec\u30fc\u30e0\u3092\u751f\u6210\u3057\u307e\u3059\u3002<span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>The third convolution layer takes a <span translate=no>_^_0_^_</span> frame and produces a <span translate=no>_^_1_^_</span> frame </p>\n": "<p>3 \u756a\u76ee\u306e\u7573\u307f\u8fbc\u307f\u5c64\u306f\u3001<span translate=no>_^_0_^_</span>\u30d5\u30ec\u30fc\u30e0\u3092\u53d6\u5f97\u3057\u3066\u30d5\u30ec\u30fc\u30e0\u3092\u751f\u6210\u3057\u307e\u3059\u3002<span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>This head gives the action value <span translate=no>_^_0_^_</span> </p>\n": "<p>\u3053\u306e\u30d8\u30c3\u30c9\u306f\u30a2\u30af\u30b7\u30e7\u30f3\u5024\u3092\u4e0e\u3048\u307e\u3059 <span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>This head gives the state value <span translate=no>_^_0_^_</span> </p>\n": "<p>\u3053\u306e\u30d8\u30c3\u30c9\u306f\u72b6\u614b\u5024\u3092\u4e0e\u3048\u307e\u3059 <span translate=no>_^_0_^_</span></p>\n",
|
||||
"Deep Q Network (DQN) Model": "\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN) \u30e2\u30c7\u30eb",
|
||||
"Implementation of neural network model for Deep Q Network (DQN).": "\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN) \u7528\u306e\u30cb\u30e5\u30fc\u30e9\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u30e2\u30c7\u30eb\u306e\u5b9f\u88c5\u3002"
|
||||
}
|
||||
4
translate_cache/rl/dqn/readme.ja.json
Normal file
4
translate_cache/rl/dqn/readme.ja.json
Normal file
@ -0,0 +1,4 @@
|
||||
{
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">Deep Q Networks (DQN)</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1312.5602\">Playing Atari with Deep Reinforcement Learning</a> along with <a href=\"https://nn.labml.ai/rl/dqn/model.html\">Dueling Network</a>, <a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">Prioritized Replay</a> and Double Q Network.</p>\n<p>Here is the <a href=\"https://nn.labml.ai/rl/dqn/experiment.html\">experiment</a> and <a href=\"https://nn.labml.ai/rl/dqn/model.html\">model</a> implementation.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/dqn/index.html\">\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN)</a></h1>\n<p>\u3053\u308c\u306f\u3001<a href=\"https://papers.labml.ai/paper/1312.5602\">\u30c7\u30a3\u30fc\u30d7\u5f37\u5316\u5b66\u7fd2\u3092\u4f7f\u3063\u305f\u30a2\u30bf\u30ea\u30d7\u30ec\u30a4\u3068\u30c7\u30e5\u30a8\u30eb\u30cd\u30c3\u30c8\u30ef\u30fc\u30af</a><a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u3001<a href=\"https://nn.labml.ai/rl/dqn/replay_buffer.html\">\u512a\u5148\u30ea\u30d7\u30ec\u30a4</a>\u3001<a href=\"https://pytorch.org\">\u30c0\u30d6\u30ebQ\u30cd\u30c3\u30c8\u30ef\u30fc\u30af\u3092PyTorch\u3067\u5b9f\u88c5\u3057\u305f\u3082\u306e\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"https://nn.labml.ai/rl/dqn/experiment.html\"><a href=\"https://nn.labml.ai/rl/dqn/model.html\">\u3053\u308c\u304c\u5b9f\u9a13\u3068\u30e2\u30c7\u30eb\u306e\u5b9f\u88c5\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/dqn/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"Deep Q Networks (DQN)": "\u30c7\u30a3\u30fc\u30d7Q\u30cd\u30c3\u30c8\u30ef\u30fc\u30af (DQN)"
|
||||
}
|
||||
48
translate_cache/rl/dqn/replay_buffer.ja.json
Normal file
48
translate_cache/rl/dqn/replay_buffer.ja.json
Normal file
File diff suppressed because one or more lines are too long
28
translate_cache/rl/game.ja.json
Normal file
28
translate_cache/rl/game.ja.json
Normal file
@ -0,0 +1,28 @@
|
||||
{
|
||||
"<h1>Atari wrapper with multi-processing</h1>\n": "<h1>\u30de\u30eb\u30c1\u30d7\u30ed\u30bb\u30c3\u30b7\u30f3\u30b0\u6a5f\u80fd\u3092\u5099\u3048\u305f Atari \u30e9\u30c3\u30d1\u30fc</h1>\n",
|
||||
"<h2>Worker Process</h2>\n<p>Each worker process runs this method</p>\n": "<h2>\u30ef\u30fc\u30ab\u30fc\u30d7\u30ed\u30bb\u30b9</h2>\n<p>\u5404\u30ef\u30fc\u30ab\u30fc\u30d7\u30ed\u30bb\u30b9\u306f\u3053\u306e\u30e1\u30bd\u30c3\u30c9\u3092\u5b9f\u884c\u3057\u307e\u3059</p>\n",
|
||||
"<h3>Reset environment</h3>\n<p>Clean up episode info and 4 frame stack</p>\n": "<h3>\u74b0\u5883\u3092\u30ea\u30bb\u30c3\u30c8</h3>\n<p>\u30a8\u30d4\u30bd\u30fc\u30c9\u60c5\u5831\u30684\u30d5\u30ec\u30fc\u30e0\u30b9\u30bf\u30c3\u30af\u306e\u30af\u30ea\u30fc\u30f3\u30a2\u30c3\u30d7</p>\n",
|
||||
"<h3>Step</h3>\n<p>Executes <span translate=no>_^_0_^_</span> for 4 time steps and returns a tuple of (observation, reward, done, episode_info).</p>\n<ul><li><span translate=no>_^_1_^_</span>: stacked 4 frames (this frame and frames for last 3 actions) </li>\n<li><span translate=no>_^_2_^_</span>: total reward while the action was executed </li>\n<li><span translate=no>_^_3_^_</span>: whether the episode finished (a life lost) </li>\n<li><span translate=no>_^_4_^_</span>: episode information if completed</li></ul>\n": "<h3>\u30b9\u30c6\u30c3\u30d7</h3>\n<p><span translate=no>_^_0_^_</span>4\u3064\u306e\u30bf\u30a4\u30e0\u30b9\u30c6\u30c3\u30d7\u3092\u5b9f\u884c\u3057\u3001(\u89b3\u6e2c\u3001\u5831\u916c\u3001\u5b8c\u4e86\u3001\u30a8\u30d4\u30bd\u30fc\u30c9\u60c5\u5831) \u306e\u30bf\u30d7\u30eb\u3092\u8fd4\u3057\u307e\u3059\u3002</p>\n<ul><li><span translate=no>_^_1_^_</span>: 4 \u3064\u306e\u30d5\u30ec\u30fc\u30e0\u3092\u7a4d\u307f\u91cd\u306d\u305f (\u3053\u306e\u30d5\u30ec\u30fc\u30e0\u3068\u6700\u5f8c\u306e 3 \u30a2\u30af\u30b7\u30e7\u30f3\u306e\u30d5\u30ec\u30fc\u30e0)</li>\n<li><span translate=no>_^_2_^_</span>: \u30a2\u30af\u30b7\u30e7\u30f3\u5b9f\u884c\u4e2d\u306e\u5831\u916c\u306e\u5408\u8a08</li>\n<li><span translate=no>_^_3_^_</span>: \u30a8\u30d4\u30bd\u30fc\u30c9\u304c\u7d42\u308f\u3063\u305f\u304b\u3069\u3046\u304b (\u547d\u304c\u5931\u308f\u308c\u305f)</li>\n</ul><li><span translate=no>_^_4_^_</span>: \u30a8\u30d4\u30bd\u30fc\u30c9\u60c5\u5831 (\u5b8c\u4e86\u3057\u305f\u5834\u5408)</li>\n",
|
||||
"<h4>Process game frames</h4>\n<p>Convert game frames to gray and rescale to 84x84</p>\n": "<h4>\u30b2\u30fc\u30e0\u30d5\u30ec\u30fc\u30e0\u306e\u51e6\u7406</h4>\n<p>\u30b2\u30fc\u30e0\u30d5\u30ec\u30fc\u30e0\u3092\u30b0\u30ec\u30fc\u306b\u5909\u63db\u3057\u300184x84\u306b\u518d\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0</p>\n",
|
||||
"<p> <a id=\"GameEnvironment\"></a></p>\n<h2>Game environment</h2>\n<p>This is a wrapper for OpenAI gym game environment. We do a few things here:</p>\n<p>1. Apply the same action on four frames and get the last frame 2. Convert observation frames to gray and scale it to (84, 84) 3. Stack four frames of the last four actions 4. Add episode information (total reward for the entire episode) for monitoring 5. Restrict an episode to a single life (game has 5 lives, we reset after every single life)</p>\n<h4>Observation format</h4>\n<p>Observation is tensor of size (4, 84, 84). It is four frames (images of the game screen) stacked on first axis. i.e, each channel is a frame.</p>\n": "<p><a id=\"GameEnvironment\"></a></p>\n<h2>\u30b2\u30fc\u30e0\u74b0\u5883</h2>\n<p>\u3053\u308c\u306fOpenAI\u30b8\u30e0\u30b2\u30fc\u30e0\u74b0\u5883\u306e\u30e9\u30c3\u30d1\u30fc\u3067\u3059\u3002\u3053\u3053\u3067\u306f\u3044\u304f\u3064\u304b\u306e\u3053\u3068\u3092\u884c\u3044\u307e\u3059\u3002</p>\n<p>1\u30024 \u3064\u306e\u30d5\u30ec\u30fc\u30e0\u306b\u540c\u3058\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u9069\u7528\u3057\u3001\u6700\u5f8c\u306e\u30d5\u30ec\u30fc\u30e0 2 \u3092\u53d6\u5f97\u3057\u307e\u3059\u3002\u89b3\u6e2c\u30d5\u30ec\u30fc\u30e0\u3092\u30b0\u30ec\u30fc\u306b\u5909\u63db\u3057\u3001(84\u300184) 3 \u306b\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0\u3057\u307e\u3059\u3002\u6700\u5f8c\u306e4\u3064\u306e\u30a2\u30af\u30b7\u30e7\u30f3\u30924\u30d5\u30ec\u30fc\u30e0\u91cd\u306d\u308b 4.\u30e2\u30cb\u30bf\u30ea\u30f3\u30b0\u7528\u306e\u30a8\u30d4\u30bd\u30fc\u30c9\u60c5\u5831 (\u30a8\u30d4\u30bd\u30fc\u30c9\u5168\u4f53\u306e\u5831\u916c\u7dcf\u984d) \u3092\u8ffd\u52a0 5.\u30a8\u30d4\u30bd\u30fc\u30c9\u30921\u3064\u306e\u30e9\u30a4\u30d5\u306b\u5236\u9650\u3057\u307e\u3059\uff08\u30b2\u30fc\u30e0\u306b\u306f\u30e9\u30a4\u30d5\u304c5\u3064\u3042\u308a\u3001\u30e9\u30a4\u30d5\u304c1\u3064\u5897\u3048\u308b\u305f\u3073\u306b\u30ea\u30bb\u30c3\u30c8\u3055\u308c\u307e\u3059</p>\uff09\n<h4>\u89b3\u6e2c\u30d5\u30a9\u30fc\u30de\u30c3\u30c8</h4>\n<p>\u89b3\u6e2c\u5024\u306f\u30b5\u30a4\u30ba (4, 84, 84) \u306e\u30c6\u30f3\u30bd\u30eb\u3067\u3059\u3002\u6700\u521d\u306e\u8ef8\u306b\u7a4d\u307f\u91cd\u306d\u3089\u308c\u305f4\u3064\u306e\u30d5\u30ec\u30fc\u30e0\uff08\u30b2\u30fc\u30e0\u753b\u9762\u306e\u753b\u50cf\uff09\u3067\u3059\u3002\u3064\u307e\u308a\u3001\u5404\u30c1\u30e3\u30f3\u30cd\u30eb\u306f\u30d5\u30ec\u30fc\u30e0\u3067\u3059</p>\u3002\n",
|
||||
"<p> Creates a new worker and runs it in a separate process.</p>\n": "<p>\u65b0\u3057\u3044\u30ef\u30fc\u30ab\u30fc\u3092\u4f5c\u6210\u3057\u3001\u5225\u306e\u30d7\u30ed\u30bb\u30b9\u3067\u5b9f\u884c\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p>and number of lives left </p>\n": "<p>\u305d\u3057\u3066\u6b8b\u3055\u308c\u305f\u547d\u306e\u6570</p>\n",
|
||||
"<p>buffer to keep the maximum of last 2 frames </p>\n": "<p>\u6700\u5f8c\u306e 2 \u30d5\u30ec\u30fc\u30e0\u307e\u3067\u4fdd\u5b58\u3059\u308b\u30d0\u30c3\u30d5\u30a1</p>\n",
|
||||
"<p>create environment </p>\n": "<p>\u74b0\u5883\u3092\u4f5c\u6210</p>\n",
|
||||
"<p>create game </p>\n": "<p>\u30b2\u30fc\u30e0\u4f5c\u6210</p>\n",
|
||||
"<p>execute the action in the OpenAI Gym environment </p>\n": "<p>OpenAI \u30b8\u30e0\u74b0\u5883\u3067\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c\u3059\u308b</p>\n",
|
||||
"<p>get number of lives left </p>\n": "<p>\u6b8b\u308a\u30e9\u30a4\u30d5\u6570\u3092\u53d6\u5f97</p>\n",
|
||||
"<p>get the max of last two frames </p>\n": "<p>\u6700\u5f8c\u306e 2 \u30d5\u30ec\u30fc\u30e0\u306e\u6700\u5927\u5024\u3092\u53d6\u5f97</p>\n",
|
||||
"<p>if finished, set episode information if episode is over, and reset </p>\n": "<p>\u7d42\u4e86\u3057\u305f\u3089\u3001\u30a8\u30d4\u30bd\u30fc\u30c9\u304c\u7d42\u4e86\u3057\u305f\u3089\u30a8\u30d4\u30bd\u30fc\u30c9\u60c5\u5831\u3092\u8a2d\u5b9a\u3057\u3001\u30ea\u30bb\u30c3\u30c8\u3057\u307e\u3059</p>\n",
|
||||
"<p>keep track of the episode rewards </p>\n": "<p>\u30a8\u30d4\u30bd\u30fc\u30c9\u306e\u5831\u916c\u3092\u628a\u63e1\u3057\u3066\u304a\u3051</p>\n",
|
||||
"<p>maintain rewards for each step </p>\n": "<p>\u5404\u30b9\u30c6\u30c3\u30d7\u306e\u5831\u916c\u3092\u7dad\u6301</p>\n",
|
||||
"<p>push it to the stack of 4 frames </p>\n": "<p>4\u30d5\u30ec\u30fc\u30e0\u306e\u30b9\u30bf\u30c3\u30af\u306b\u30d7\u30c3\u30b7\u30e5</p>\n",
|
||||
"<p>reset OpenAI Gym environment </p>\n": "<p>OpenAI \u30b8\u30e0\u74b0\u5883\u3092\u30ea\u30bb\u30c3\u30c8</p>\n",
|
||||
"<p>reset caches </p>\n": "<p>\u30ad\u30e3\u30c3\u30b7\u30e5\u3092\u30ea\u30bb\u30c3\u30c8</p>\n",
|
||||
"<p>reset if a life is lost </p>\n": "<p>\u547d\u304c\u5931\u308f\u308c\u305f\u3089\u30ea\u30bb\u30c3\u30c8</p>\n",
|
||||
"<p>run for 4 steps </p>\n": "<p>4 \u30b9\u30c6\u30c3\u30d7\u5b9f\u884c</p>\n",
|
||||
"<p>tensor for a stack of 4 frames </p>\n": "<p>4\u30d5\u30ec\u30fc\u30e0\u306e\u30b9\u30bf\u30c3\u30af\u306e\u30c6\u30f3\u30bd\u30eb</p>\n",
|
||||
"<p>wait for instructions from the connection and execute them </p>\n": "<p>\u63a5\u7d9a\u304b\u3089\u306e\u6307\u793a\u3092\u5f85\u3063\u3066\u5b9f\u884c\u3059\u308b</p>\n",
|
||||
"Atari wrapper with multi-processing": "\u30de\u30eb\u30c1\u30d7\u30ed\u30bb\u30c3\u30b7\u30f3\u30b0\u6a5f\u80fd\u3092\u5099\u3048\u305f Atari \u30e9\u30c3\u30d1\u30fc",
|
||||
"This implements the Atari games with multi-processing.": "\u3053\u308c\u306b\u3088\u308a\u3001Atari\u306e\u30b2\u30fc\u30e0\u304c\u30de\u30eb\u30c1\u30d7\u30ed\u30bb\u30c3\u30b7\u30f3\u30b0\u3067\u5b9f\u88c5\u3055\u308c\u307e\u3059\u3002"
|
||||
}
|
||||
9
translate_cache/rl/ppo/__init__.ja.json
Normal file
9
translate_cache/rl/ppo/__init__.ja.json
Normal file
@ -0,0 +1,9 @@
|
||||
{
|
||||
"<h1>Proximal Policy Optimization - PPO</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://papers.labml.ai/paper/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a single sample causes problems because the policy deviates too much, producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>. The experiment uses <a href=\"gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO</h1>\n<p><a href=\"https://papers.labml.ai/paper/1707.06347\">\u3053\u308c\u306f\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316</a>\uff08PPO\uff09<a href=\"https://pytorch.org\">\u306ePyTorch\u5b9f\u88c5\u3067\u3059</a>\u3002</p>\n<p>PPO\u306f\u5f37\u5316\u5b66\u7fd2\u306e\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u6cd5\u3067\u3059\u3002\u30b7\u30f3\u30d7\u30eb\u306a\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30e1\u30bd\u30c3\u30c9\u3067\u306f\u3001\u30b5\u30f3\u30d7\u30eb (\u307e\u305f\u306f\u30b5\u30f3\u30d7\u30eb\u30bb\u30c3\u30c8) \u3054\u3068\u306b 1 \u56de\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3044\u307e\u3059\u30021\u3064\u306e\u30b5\u30f3\u30d7\u30eb\u306b\u5bfe\u3057\u3066\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30b9\u30c6\u30c3\u30d7\u3092\u5b9f\u884c\u3059\u308b\u3068\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u504f\u5dee\u304c\u5927\u304d\u3059\u304e\u3066\u4e0d\u9069\u5207\u306a\u30dd\u30ea\u30b7\u30fc\u306b\u306a\u308b\u305f\u3081\u3001\u554f\u984c\u304c\u767a\u751f\u3057\u307e\u3059\u3002PPO \u3067\u306f\u3001\u30dd\u30ea\u30b7\u30fc\u3092\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3057\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u8fd1\u3044\u72b6\u614b\u306b\u4fdd\u3064\u3053\u3068\u3067\u3001\u30b5\u30f3\u30d7\u30eb\u3054\u3068\u306b\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3046\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002\u66f4\u65b0\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u304c\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u5408\u308f\u306a\u3044\u5834\u5408\u306f\u3001\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30d5\u30ed\u30fc\u3092\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u3057\u3066\u66f4\u65b0\u3057\u307e\u3059</p>\u3002\n<p><a href=\"experiment.html\">\u3053\u308c\u3092\u4f7f\u3063\u305f\u5b9f\u9a13\u306f\u3053\u3061\u3089\u304b\u3089\u3054\u89a7\u3044\u305f\u3060\u3051\u307e\u3059</a>\u3002\u3053\u306e\u5b9f\u9a13\u3067\u306f\u3001<a href=\"gae.html\">\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a></p>\u3002\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Clipped Value Function Loss</h2>\n<p>Similarly we clip the value function update also.</p>\n<span translate=no>_^_0_^_</span><p>Clipping makes sure the value function <span translate=no>_^_1_^_</span> doesn't deviate significantly from <span translate=no>_^_2_^_</span>.</p>\n": "<h2>\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u30d0\u30ea\u30e5\u30fc\u95a2\u6570\u306e\u640d\u5931</h2>\n<p>\u540c\u69d8\u306b\u3001\u5024\u95a2\u6570\u306e\u66f4\u65b0\u3082\u30af\u30ea\u30c3\u30d7\u3057\u307e\u3059\u3002</p>\n<span translate=no>_^_0_^_</span><p>\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u306b\u3088\u308a\u3001<span translate=no>_^_1_^_</span>\u5024\u95a2\u6570\u304c\u5927\u304d\u304f\u305a\u308c\u306a\u3044\u3088\u3046\u306b\u3057\u307e\u3059\u3002<span translate=no>_^_2_^_</span></p>\n",
|
||||
"<h2>PPO Loss</h2>\n<p>Here's how the PPO update rule is derived.</p>\n<p>We want to maximize policy reward <span translate=no>_^_0_^_</span> where <span translate=no>_^_1_^_</span> is the reward, <span translate=no>_^_2_^_</span> is the policy, <span translate=no>_^_3_^_</span> is a trajectory sampled from policy, and <span translate=no>_^_4_^_</span> is the discount factor between <span translate=no>_^_5_^_</span>.</p>\n<span translate=no>_^_6_^_</span><p>So, <span translate=no>_^_7_^_</span></p>\n<p>Define discounted-future state distribution, <span translate=no>_^_8_^_</span></p>\n<p>Then,</p>\n<span translate=no>_^_9_^_</span><p>Importance sampling <span translate=no>_^_10_^_</span> from <span translate=no>_^_11_^_</span>,</p>\n<span translate=no>_^_12_^_</span><p>Then we assume <span translate=no>_^_13_^_</span> and <span translate=no>_^_14_^_</span> are similar. The error we introduce to <span translate=no>_^_15_^_</span> by this assumption is bound by the KL divergence between <span translate=no>_^_16_^_</span> and <span translate=no>_^_17_^_</span>. <a href=\"https://papers.labml.ai/paper/1705.10528\">Constrained Policy Optimization</a> shows the proof of this. I haven't read it.</p>\n<span translate=no>_^_18_^_</span>": "<h2>PPO \u30ed\u30b9</h2>\n<p>PPO \u66f4\u65b0\u30eb\u30fc\u30eb\u306f\u6b21\u306e\u65b9\u6cd5\u3067\u5c0e\u304d\u51fa\u3055\u308c\u307e\u3059\u3002</p>\n<p><span translate=no>_^_0_^_</span>\u3053\u3053\u3067\u3001<span translate=no>_^_1_^_</span>\u304c\u5831\u916c\u3001\u304c\u30dd\u30ea\u30b7\u30fc\u3001<span translate=no>_^_2_^_</span>\u304c\u30dd\u30ea\u30b7\u30fc\u304b\u3089\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u305f\u8ecc\u8de1\u3001<span translate=no>_^_3_^_</span><span translate=no>_^_4_^_</span>\u305d\u3057\u3066\u305d\u306e\u9593\u306e\u5272\u5f15\u4fc2\u6570\u3067\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u5831\u916c\u3092\u6700\u5927\u5316\u3057\u305f\u3044\u3068\u8003\u3048\u3066\u3044\u307e\u3059\u3002<span translate=no>_^_5_^_</span></p>\n<span translate=no>_^_6_^_</span><p>\u3060\u304b\u3089\u3001<span translate=no>_^_7_^_</span></p>\n<p>\u5272\u5f15\u5f8c\u306e\u5c06\u6765\u306e\u72b6\u614b\u5206\u5e03\u3092\u5b9a\u7fa9\u3057\u3001<span translate=no>_^_8_^_</span></p>\n<p>\u6b21\u306b\u3001</p>\n<span translate=no>_^_9_^_</span><p><span translate=no>_^_10_^_</span><span translate=no>_^_11_^_</span>\u304b\u3089\u306e\u91cd\u8981\u5ea6\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0</p>\n<span translate=no>_^_12_^_</span><p>\u305d\u3046\u3059\u308b\u3068\u3001<span translate=no>_^_13_^_</span><span translate=no>_^_14_^_</span>\u4f3c\u305f\u3088\u3046\u306a\u3082\u306e\u3060\u3068\u4eee\u5b9a\u3057\u307e\u3059\u3002<span translate=no>_^_15_^_</span>\u3053\u306e\u4eee\u5b9a\u306b\u3088\u3063\u3066\u751f\u3058\u308b\u8aa4\u5dee\u306f\u3001<span translate=no>_^_16_^_</span>\u3068\u306e\u9593\u306e KL \u306e\u76f8\u9055\u306b\u3088\u3063\u3066\u6c7a\u307e\u308a\u307e\u3059\u3002<span translate=no>_^_17_^_</span><a href=\"https://papers.labml.ai/paper/1705.10528\">\u5236\u7d04\u4ed8\u304d\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316\u306f\u305d\u306e\u8a3c\u62e0\u3067\u3059</a>\u3002\u307e\u3060\u8aad\u3093\u3067\u306a\u3044\u3088\u3002</p>\n<span translate=no>_^_18_^_</span>",
|
||||
"<h3>Cliping the policy ratio</h3>\n<span translate=no>_^_0_^_</span><p>The ratio is clipped to be close to 1. We take the minimum so that the gradient will only pull <span translate=no>_^_1_^_</span> towards <span translate=no>_^_2_^_</span> if the ratio is not between <span translate=no>_^_3_^_</span> and <span translate=no>_^_4_^_</span>. This keeps the KL divergence between <span translate=no>_^_5_^_</span> and <span translate=no>_^_6_^_</span> constrained. Large deviation can cause performance collapse; where the policy performance drops and doesn't recover because we are sampling from a bad policy.</p>\n<p>Using the normalized advantage <span translate=no>_^_7_^_</span> introduces a bias to the policy gradient estimator, but it reduces variance a lot. </p>\n": "<h3>\u30dd\u30ea\u30b7\u30fc\u6bd4\u7387\u306e\u30af\u30ea\u30c3\u30d4\u30f3\u30b0</h3>\n<span translate=no>_^_0_^_</span><p>\u6bd4\u7387\u306f 1 \u306b\u8fd1\u3065\u304f\u3088\u3046\u306b\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u3055\u308c\u307e\u3059\u3002<span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span><span translate=no>_^_3_^_</span>\u6bd4\u7387\u304c\u3068\u306e\u9593\u3067\u306a\u3044\u5834\u5408\u306b\u306e\u307f\u52fe\u914d\u304c\u50be\u304f\u3088\u3046\u306b\u6700\u5c0f\u5316\u3057\u3066\u3044\u307e\u3059<span translate=no>_^_4_^_</span>\u3002\u3053\u308c\u306b\u3088\u308a\u3001\u3068\u306e\u9593\u306e KL <span translate=no>_^_5_^_</span> \u306e\u76f8\u9055\u304c\u6291\u3048\u3089\u308c\u307e\u3059<span translate=no>_^_6_^_</span>\u3002\u5927\u304d\u306a\u504f\u5dee\u304c\u3042\u308b\u3068\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u304c\u4f4e\u4e0b\u3057\u3001\u4e0d\u9069\u5207\u306a\u30dd\u30ea\u30b7\u30fc\u304b\u3089\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3057\u3066\u3044\u308b\u305f\u3081\u306b\u30dd\u30ea\u30b7\u30fc\u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u304c\u4f4e\u4e0b\u3057\u3001\u56de\u5fa9\u3057\u306a\u3044\u5834\u5408\u304c\u3042\u308a\u307e\u3059\u3002</p>\n<p>\u6b63\u898f\u5316\u3055\u308c\u305f\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u3092\u4f7f\u7528\u3059\u308b\u3068\u3001<span translate=no>_^_7_^_</span>\u30dd\u30ea\u30b7\u30fc\u52fe\u914d\u63a8\u5b9a\u91cf\u306b\u504f\u308a\u304c\u751f\u3058\u307e\u3059\u304c\u3001\u5206\u6563\u306f\u5927\u5e45\u306b\u6e1b\u5c11\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p>ratio <span translate=no>_^_0_^_</span>; <em>this is different from rewards</em> <span translate=no>_^_1_^_</span>. </p>\n": "<p>\u6bd4\u7387<span translate=no>_^_0_^_</span>\u3002<em>\u3053\u308c\u306f\u5831\u916c\u3068\u306f\u7570\u306a\u308a\u307e\u3059</em><span translate=no>_^_1_^_</span>\u3002</p>\n",
|
||||
"An annotated implementation of Proximal Policy Optimization - PPO algorithm in PyTorch.": "PyTorch\u306e\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO\u30a2\u30eb\u30b4\u30ea\u30ba\u30e0\u306e\u6ce8\u91c8\u4ed8\u304d\u5b9f\u88c5\u3002",
|
||||
"Proximal Policy Optimization - PPO": "\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO"
|
||||
}
|
||||
91
translate_cache/rl/ppo/experiment.ja.json
Normal file
91
translate_cache/rl/ppo/experiment.ja.json
Normal file
@ -0,0 +1,91 @@
|
||||
{
|
||||
"<h1>PPO Experiment with Atari Breakout</h1>\n<p>This experiment trains Proximal Policy Optimization (PPO) agent Atari Breakout game on OpenAI Gym. It runs the <a href=\"../game.html\">game environments on multiple processes</a> to sample efficiently.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n": "<h1>\u30a2\u30bf\u30ea\u30fb\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u306b\u3088\u308bPPO\u5b9f\u9a13</h1>\n<p>\u3053\u306e\u5b9f\u9a13\u3067\u306f\u3001OpenAI Gym\u3067\u30d7\u30ed\u30ad\u30b7\u30de\u30eb\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316\uff08PPO\uff09\u30a8\u30fc\u30b8\u30a7\u30f3\u30c8\u306eAtari\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u30b2\u30fc\u30e0\u3092\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3057\u307e\u3059\u3002<a href=\"../game.html\">\u30b2\u30fc\u30e0\u74b0\u5883\u3092\u8907\u6570\u306e\u30d7\u30ed\u30bb\u30b9\u3067\u5b9f\u884c\u3057\u3066\u52b9\u7387\u7684\u306b\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3057\u307e\u3059</a>\u3002</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"<h2>Model</h2>\n": "<h2>\u30e2\u30c7\u30eb</h2>\n",
|
||||
"<h2>Run it</h2>\n": "<h2>\u5b9f\u884c\u3057\u3066\u304f\u3060\u3055\u3044</h2>\n",
|
||||
"<h2>Trainer</h2>\n": "<h2>\u30c8\u30ec\u30fc\u30ca\u30fc</h2>\n",
|
||||
"<h3>Calculate total loss</h3>\n": "<h3>\u7dcf\u640d\u5931\u306e\u8a08\u7b97</h3>\n",
|
||||
"<h3>Destroy</h3>\n<p>Stop the workers</p>\n": "<h3>\u7834\u58ca</h3>\n<p>\u52b4\u50cd\u8005\u3092\u6b62\u3081\u308d</p>\n",
|
||||
"<h3>Run training loop</h3>\n": "<h3>\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u30eb\u30fc\u30d7\u3092\u5b9f\u884c</h3>\n",
|
||||
"<h3>Sample data with current policy</h3>\n": "<h3>\u73fe\u5728\u306e\u30dd\u30ea\u30b7\u30fc\u3092\u542b\u3080\u30b5\u30f3\u30d7\u30eb\u30c7\u30fc\u30bf</h3>\n",
|
||||
"<h3>Train the model based on samples</h3>\n": "<h3>\u30b5\u30f3\u30d7\u30eb\u306b\u57fa\u3065\u3044\u3066\u30e2\u30c7\u30eb\u3092\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3059\u308b</h3>\n",
|
||||
"<h4>Configurations</h4>\n": "<h4>\u30b3\u30f3\u30d5\u30a3\u30ae\u30e5\u30ec\u30fc\u30b7\u30e7\u30f3</h4>\n",
|
||||
"<h4>Initialize</h4>\n": "<h4>[\u521d\u671f\u5316]</h4>\n",
|
||||
"<h4>Normalize advantage function</h4>\n": "<h4>\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u95a2\u6570\u306e\u6b63\u898f\u5316</h4>\n",
|
||||
"<p> </p>\n": "<p></p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span> keeps track of the last observation from each worker, which is the input for the model to sample the next action </p>\n": "<p><span translate=no>_^_0_^_</span>\u5404\u30ef\u30fc\u30ab\u30fc\u304b\u3089\u306e\u6700\u5f8c\u306e\u89b3\u6e2c\u5024\u3092\u8ffd\u8de1\u3057\u307e\u3059\u3002\u3053\u308c\u306f\u3001\u30e2\u30c7\u30eb\u304c\u6b21\u306e\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3059\u308b\u305f\u3081\u306e\u5165\u529b\u3067\u3059</p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span> returns sampled from <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u304b\u3089\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u305f\u30ea\u30bf\u30fc\u30f3 <span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span>, <span translate=no>_^_1_^_</span> are actions sampled from <span translate=no>_^_2_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span><span translate=no>_^_1_^_</span>\u30a2\u30af\u30b7\u30e7\u30f3\u306f\u4ee5\u4e0b\u304b\u3089\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u307e\u3059 <span translate=no>_^_2_^_</span></p>\n",
|
||||
"<p><span translate=no>_^_0_^_</span>, where <span translate=no>_^_1_^_</span> is advantages sampled from <span translate=no>_^_2_^_</span>. Refer to sampling function in <a href=\"#main\">Main class</a> below for the calculation of <span translate=no>_^_3_^_</span>. </p>\n": "<p><span translate=no>_^_0_^_</span>\u3001<span translate=no>_^_1_^_</span><span translate=no>_^_2_^_</span>\u5229\u70b9\u306f\u3069\u3053\u304b\u3089\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u3066\u3044\u308b\u306e\u304b\u3002\u306e\u8a08\u7b97\u306b\u3064\u3044\u3066\u306f\u3001<a href=\"#main\">\u4e0b\u8a18\u306e\u30e1\u30a4\u30f3\u30af\u30e9\u30b9\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u95a2\u6570\u3092\u53c2\u7167\u3057\u3066\u304f\u3060\u3055\u3044</a><span translate=no>_^_3_^_</span>\u3002</p>\n",
|
||||
"<p>A fully connected layer takes the flattened frame from third convolution layer, and outputs 512 features </p>\n": "<p>\u5b8c\u5168\u7d50\u5408\u5c64\u306f\u30013 \u756a\u76ee\u306e\u7573\u307f\u8fbc\u307f\u5c64\u304b\u3089\u5e73\u5766\u5316\u3055\u308c\u305f\u30d5\u30ec\u30fc\u30e0\u3092\u53d6\u308a\u51fa\u3057\u3001512 \u500b\u306e\u7279\u5fb4\u3092\u51fa\u529b\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p>A fully connected layer to get logits for <span translate=no>_^_0_^_</span> </p>\n": "<p>\u30ed\u30b8\u30c3\u30c8\u3092\u53d6\u5f97\u3059\u308b\u305f\u3081\u306e\u5b8c\u5168\u63a5\u7d9a\u30ec\u30a4\u30e4\u30fc <span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>A fully connected layer to get value function </p>\n": "<p>\u30d0\u30ea\u30e5\u30fc\u95a2\u6570\u3092\u5f97\u308b\u305f\u3081\u306e\u5b8c\u5168\u9023\u7d50\u30ec\u30a4\u30e4\u30fc</p>\n",
|
||||
"<p>Add a new line to the screen periodically </p>\n": "<p>\u753b\u9762\u306b\u5b9a\u671f\u7684\u306b\u65b0\u3057\u3044\u884c\u3092\u8ffd\u52a0\u3057\u3066\u304f\u3060\u3055\u3044</p>\n",
|
||||
"<p>Add to tracker </p>\n": "<p>\u30c8\u30e9\u30c3\u30ab\u30fc\u306b\u8ffd\u52a0</p>\n",
|
||||
"<p>Calculate Entropy Bonus</p>\n<p><span translate=no>_^_0_^_</span> </p>\n": "<p>\u30a8\u30f3\u30c8\u30ed\u30d4\u30fc\u30dc\u30fc\u30ca\u30b9\u306e\u8a08\u7b97</p>\n<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>Calculate gradients </p>\n": "<p>\u52fe\u914d\u306e\u8a08\u7b97</p>\n",
|
||||
"<p>Calculate policy loss </p>\n": "<p>\u4fdd\u967a\u5951\u7d04\u640d\u5931\u306e\u8a08\u7b97</p>\n",
|
||||
"<p>Calculate value function loss </p>\n": "<p>\u5024\u95a2\u6570\u640d\u5931\u306e\u8a08\u7b97</p>\n",
|
||||
"<p>Clip gradients </p>\n": "<p>\u30af\u30ea\u30c3\u30d7\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3</p>\n",
|
||||
"<p>Clipping range </p>\n": "<p>\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u7bc4\u56f2</p>\n",
|
||||
"<p>Configurations </p>\n": "<p>\u30b3\u30f3\u30d5\u30a3\u30ae\u30e5\u30ec\u30fc\u30b7\u30e7\u30f3</p>\n",
|
||||
"<p>Create the experiment </p>\n": "<p>\u5b9f\u9a13\u3092\u4f5c\u6210</p>\n",
|
||||
"<p>Entropy bonus coefficient </p>\n": "<p>\u30a8\u30f3\u30c8\u30ed\u30d4\u30fc\u30dc\u30fc\u30ca\u30b9\u4fc2\u6570</p>\n",
|
||||
"<p>GAE with <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span> </p>\n": "<p>GATE (<span translate=no>_^_0_^_</span>\u304a\u3088\u3073\u4ed8\u304d) <span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>Get value of after the final step </p>\n": "<p>\u6700\u5f8c\u306e\u30b9\u30c6\u30c3\u30d7\u306e\u5f8c\u306b\u5024\u3092\u53d6\u5f97</p>\n",
|
||||
"<p>Initialize the trainer </p>\n": "<p>\u30c8\u30ec\u30fc\u30ca\u30fc\u3092\u521d\u671f\u5316</p>\n",
|
||||
"<p>It learns faster with a higher number of epochs, but becomes a little unstable; that is, the average episode reward does not monotonically increase over time. May be reducing the clipping range might solve it. </p>\n": "<p>\u30a8\u30dd\u30c3\u30af\u6570\u304c\u591a\u3044\u307b\u3069\u5b66\u7fd2\u306f\u901f\u304f\u306a\u308a\u307e\u3059\u304c\u3001\u5c11\u3057\u4e0d\u5b89\u5b9a\u306b\u306a\u308a\u307e\u3059\u3002\u3064\u307e\u308a\u3001\u30a8\u30d4\u30bd\u30fc\u30c9\u306e\u5e73\u5747\u5831\u916c\u306f\u6642\u9593\u306e\u7d4c\u904e\u3068\u3068\u3082\u306b\u5358\u8abf\u306b\u5897\u52a0\u3057\u307e\u305b\u3093\u3002\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u7bc4\u56f2\u3092\u72ed\u304f\u3059\u308b\u3053\u3068\u3067\u89e3\u6c7a\u3059\u308b\u53ef\u80fd\u6027\u304c\u3042\u308a\u307e\u3059\u3002</p>\n",
|
||||
"<p>Learning rate </p>\n": "<p>\u5b66\u7fd2\u7387</p>\n",
|
||||
"<p>Number of mini batches </p>\n": "<p>\u30df\u30cb\u30d0\u30c3\u30c1\u6570</p>\n",
|
||||
"<p>Number of steps to run on each process for a single update </p>\n": "<p>1 \u56de\u306e\u66f4\u65b0\u3067\u5404\u30d7\u30ed\u30bb\u30b9\u3067\u5b9f\u884c\u3059\u308b\u30b9\u30c6\u30c3\u30d7\u306e\u6570</p>\n",
|
||||
"<p>Number of updates </p>\n": "<p>\u66f4\u65b0\u56de\u6570</p>\n",
|
||||
"<p>Number of worker processes </p>\n": "<p>\u30ef\u30fc\u30ab\u30fc\u30d7\u30ed\u30bb\u30b9\u306e\u6570</p>\n",
|
||||
"<p>PPO Loss </p>\n": "<p>PPO \u30ed\u30b9</p>\n",
|
||||
"<p>Run and monitor the experiment </p>\n": "<p>\u5b9f\u9a13\u306e\u5b9f\u884c\u3068\u76e3\u8996</p>\n",
|
||||
"<p>Sampled observations are fed into the model to get <span translate=no>_^_0_^_</span> and <span translate=no>_^_1_^_</span>; we are treating observations as state </p>\n": "<p><span translate=no>_^_0_^_</span>\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u3055\u308c\u305f\u89b3\u6e2c\u5024\u306f\u30e2\u30c7\u30eb\u306b\u5165\u529b\u3055\u308c\u3001\u53d6\u5f97\u3055\u308c\u307e\u3059<span translate=no>_^_1_^_</span>\u3002\u89b3\u6e2c\u5024\u306f\u72b6\u614b\u3068\u3057\u3066\u6271\u3044\u307e\u3059</p>\n",
|
||||
"<p>Save tracked indicators. </p>\n": "<p>\u8ffd\u8de1\u6307\u6a19\u3092\u4fdd\u5b58\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p>Scale observations from <span translate=no>_^_0_^_</span> to <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u89b3\u6e2c\u5024\u3092\u304b\u3089\u306b\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0 <span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>Select device </p>\n": "<p>\u30c7\u30d0\u30a4\u30b9\u3092\u9078\u629e</p>\n",
|
||||
"<p>Set learning rate </p>\n": "<p>\u5b66\u7fd2\u7387\u3092\u8a2d\u5b9a</p>\n",
|
||||
"<p>Stop the workers </p>\n": "<p>\u52b4\u50cd\u8005\u3092\u6b62\u3081\u308d</p>\n",
|
||||
"<p>The first convolution layer takes a 84x84 frame and produces a 20x20 frame </p>\n": "<p>\u6700\u521d\u306e\u7573\u307f\u8fbc\u307f\u5c64\u306f 84 x 84 \u30d5\u30ec\u30fc\u30e0\u3067\u300120 x 20 \u30d5\u30ec\u30fc\u30e0\u3092\u751f\u6210\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p>The second convolution layer takes a 20x20 frame and produces a 9x9 frame </p>\n": "<p>2 \u756a\u76ee\u306e\u7573\u307f\u8fbc\u307f\u5c64\u306f 20x20 \u30d5\u30ec\u30fc\u30e0\u3067\u30019x9 \u30d5\u30ec\u30fc\u30e0\u3092\u751f\u6210\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p>The third convolution layer takes a 9x9 frame and produces a 7x7 frame </p>\n": "<p>3 \u756a\u76ee\u306e\u7573\u307f\u8fbc\u307f\u5c64\u306f 9x9 \u30d5\u30ec\u30fc\u30e0\u3067 7x7 \u30d5\u30ec\u30fc\u30e0\u3092\u751f\u6210\u3057\u307e\u3059\u3002</p>\n",
|
||||
"<p>Update parameters based on gradients </p>\n": "<p>\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u306b\u57fa\u3065\u3044\u3066\u30d1\u30e9\u30e1\u30fc\u30bf\u3092\u66f4\u65b0</p>\n",
|
||||
"<p>Value Loss </p>\n": "<p>\u4fa1\u5024\u640d\u5931</p>\n",
|
||||
"<p>Value loss coefficient </p>\n": "<p>\u4fa1\u5024\u640d\u5931\u4fc2\u6570</p>\n",
|
||||
"<p>You can change this while the experiment is running. \u2699\ufe0f Learning rate. </p>\n": "<p>\u30c6\u30b9\u30c8\u306e\u5b9f\u884c\u4e2d\u306b\u3053\u308c\u3092\u5909\u66f4\u3067\u304d\u307e\u3059\u3002\u2699\ufe0f \u5b66\u7fd2\u7387\u3002</p>\n",
|
||||
"<p>Zero out the previously calculated gradients </p>\n": "<p>\u4ee5\u524d\u306b\u8a08\u7b97\u3057\u305f\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u3092\u30bc\u30ed\u306b\u3057\u307e\u3059</p>\n",
|
||||
"<p>calculate advantages </p>\n": "<p>\u5229\u70b9\u3092\u8a08\u7b97</p>\n",
|
||||
"<p>collect episode info, which is available if an episode finished; this includes total reward and length of the episode - look at <span translate=no>_^_0_^_</span> to see how it works. </p>\n": "<p>\u30a8\u30d4\u30bd\u30fc\u30c9\u306e\u60c5\u5831\u3092\u96c6\u3081\u307e\u3057\u3087\u3046\u3002<span translate=no>_^_0_^_</span>\u30a8\u30d4\u30bd\u30fc\u30c9\u304c\u7d42\u4e86\u3057\u305f\u3068\u304d\u306b\u5165\u624b\u3067\u304d\u307e\u3059\u3002\u3053\u308c\u306b\u306f\u5831\u916c\u7dcf\u984d\u3084\u30a8\u30d4\u30bd\u30fc\u30c9\u306e\u9577\u3055\u304c\u542b\u307e\u308c\u307e\u3059\u3002\u4ed5\u7d44\u307f\u3092\u78ba\u8a8d\u3057\u3066\u307f\u307e\u3057\u3087\u3046\u3002</p>\n",
|
||||
"<p>create workers </p>\n": "<p>\u30ef\u30fc\u30ab\u30fc\u3092\u4f5c\u6210</p>\n",
|
||||
"<p>for each mini batch </p>\n": "<p>\u5404\u30df\u30cb\u30d0\u30c3\u30c1\u7528</p>\n",
|
||||
"<p>for monitoring </p>\n": "<p>\u76e3\u8996\u7528</p>\n",
|
||||
"<p>get mini batch </p>\n": "<p>\u30df\u30cb\u30d0\u30c3\u30c1\u3092\u5165\u624b</p>\n",
|
||||
"<p>get results after executing the actions </p>\n": "<p>\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c\u3057\u305f\u5f8c\u306b\u7d50\u679c\u3092\u53d6\u5f97</p>\n",
|
||||
"<p>initialize tensors for observations </p>\n": "<p>\u89b3\u6e2c\u7528\u306e\u30c6\u30f3\u30bd\u30eb\u3092\u521d\u671f\u5316</p>\n",
|
||||
"<p>last 100 episode information </p>\n": "<p>\u6700\u5f8c\u306e 100 \u8a71\u306e\u60c5\u5831</p>\n",
|
||||
"<p>model </p>\n": "<p>\u30e2\u30c7\u30eb</p>\n",
|
||||
"<p>number of epochs to train the model with sampled data </p>\n": "<p>\u30b5\u30f3\u30d7\u30eb\u30c7\u30fc\u30bf\u3092\u4f7f\u7528\u3057\u3066\u30e2\u30c7\u30eb\u3092\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3059\u308b\u30a8\u30dd\u30c3\u30af\u306e\u6570</p>\n",
|
||||
"<p>number of mini batches </p>\n": "<p>\u30df\u30cb\u30d0\u30c3\u30c1\u6570</p>\n",
|
||||
"<p>number of steps to run on each process for a single update </p>\n": "<p>1 \u56de\u306e\u66f4\u65b0\u3067\u5404\u30d7\u30ed\u30bb\u30b9\u3067\u5b9f\u884c\u3059\u308b\u30b9\u30c6\u30c3\u30d7\u306e\u6570</p>\n",
|
||||
"<p>number of updates </p>\n": "<p>\u66f4\u65b0\u56de\u6570</p>\n",
|
||||
"<p>number of worker processes </p>\n": "<p>\u30ef\u30fc\u30ab\u30fc\u30d7\u30ed\u30bb\u30b9\u306e\u6570</p>\n",
|
||||
"<p>optimizer </p>\n": "<p>\u30aa\u30d7\u30c6\u30a3\u30de\u30a4\u30b6\u30fc</p>\n",
|
||||
"<p>run sampled actions on each worker </p>\n": "<p>\u5404\u30ef\u30fc\u30ab\u30fc\u3067\u30b5\u30f3\u30d7\u30eb\u30a2\u30af\u30b7\u30e7\u30f3\u3092\u5b9f\u884c</p>\n",
|
||||
"<p>sample <span translate=no>_^_0_^_</span> from each worker </p>\n": "<p><span translate=no>_^_0_^_</span>\u5404\u52b4\u50cd\u8005\u304b\u3089\u306e\u30b5\u30f3\u30d7\u30eb</p>\n",
|
||||
"<p>sample actions from <span translate=no>_^_0_^_</span> for each worker; this returns arrays of size <span translate=no>_^_1_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span>\u5404\u30ef\u30fc\u30ab\u30fc\u306e\u30b5\u30f3\u30d7\u30eb\u30a2\u30af\u30b7\u30e7\u30f3\u3002\u3053\u308c\u306f\u30b5\u30a4\u30ba\u306e\u914d\u5217\u3092\u8fd4\u3057\u307e\u3059 <span translate=no>_^_1_^_</span></p>\n",
|
||||
"<p>sample with current policy </p>\n": "<p>\u73fe\u884c\u30dd\u30ea\u30b7\u30fc\u306e\u30b5\u30f3\u30d7\u30eb</p>\n",
|
||||
"<p>samples are currently in <span translate=no>_^_0_^_</span> table, we should flatten it for training </p>\n": "<p><span translate=no>_^_0_^_</span>\u30b5\u30f3\u30d7\u30eb\u306f\u73fe\u5728\u30c6\u30fc\u30d6\u30eb\u306b\u3042\u308b\u306e\u3067\u3001\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u7528\u306b\u5e73\u3089\u306b\u3059\u308b\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059</p>\n",
|
||||
"<p>shuffle for each epoch </p>\n": "<p>\u5404\u30a8\u30dd\u30c3\u30af\u306e\u30b7\u30e3\u30c3\u30d5\u30eb</p>\n",
|
||||
"<p>size of a mini batch </p>\n": "<p>\u30df\u30cb\u30d0\u30c3\u30c1\u306e\u30b5\u30a4\u30ba</p>\n",
|
||||
"<p>total number of samples for a single update </p>\n": "<p>1 \u56de\u306e\u66f4\u65b0\u3067\u306e\u30b5\u30f3\u30d7\u30eb\u306e\u7dcf\u6570</p>\n",
|
||||
"<p>train </p>\n": "<p>\u5217\u8eca</p>\n",
|
||||
"<p>train the model </p>\n": "<p>\u30e2\u30c7\u30eb\u306e\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0</p>\n",
|
||||
"<p>\u2699\ufe0f Clip range. </p>\n": "<p>\u2699\ufe0f \u30af\u30ea\u30c3\u30d7\u30ec\u30f3\u30b8\u3002</p>\n",
|
||||
"<p>\u2699\ufe0f Entropy bonus coefficient. You can change this while the experiment is running. </p>\n": "<p>\u2699\ufe0f \u30a8\u30f3\u30c8\u30ed\u30d4\u30fc\u30dc\u30fc\u30ca\u30b9\u4fc2\u6570\u3002\u3053\u308c\u306f\u5b9f\u9a13\u306e\u5b9f\u884c\u4e2d\u306b\u5909\u66f4\u3067\u304d\u307e\u3059\u3002</p>\n",
|
||||
"<p>\u2699\ufe0f Number of epochs to train the model with sampled data. You can change this while the experiment is running. </p>\n": "<p>\u2699\ufe0f \u30b5\u30f3\u30d7\u30eb\u30c7\u30fc\u30bf\u3092\u4f7f\u7528\u3057\u3066\u30e2\u30c7\u30eb\u3092\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3059\u308b\u30a8\u30dd\u30c3\u30af\u306e\u6570\u3002\u3053\u308c\u306f\u5b9f\u9a13\u306e\u5b9f\u884c\u4e2d\u306b\u5909\u66f4\u3067\u304d\u307e\u3059\u3002</p>\n",
|
||||
"<p>\u2699\ufe0f Value loss coefficient. You can change this while the experiment is running. </p>\n": "<p>\u2699\ufe0f \u4fa1\u5024\u640d\u5931\u4fc2\u6570\u3002\u3053\u308c\u306f\u5b9f\u9a13\u306e\u5b9f\u884c\u4e2d\u306b\u5909\u66f4\u3067\u304d\u307e\u3059\u3002</p>\n",
|
||||
"Annotated implementation to train a PPO agent on Atari Breakout game.": "Atari Breakout \u30b2\u30fc\u30e0\u3067 PPO \u30a8\u30fc\u30b8\u30a7\u30f3\u30c8\u3092\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3059\u308b\u305f\u3081\u306e\u6ce8\u91c8\u4ed8\u304d\u5b9f\u88c5\u3002",
|
||||
"PPO Experiment with Atari Breakout": "\u30a2\u30bf\u30ea\u30fb\u30d6\u30ec\u30a4\u30af\u30a2\u30a6\u30c8\u306b\u3088\u308bPPO\u5b9f\u9a13"
|
||||
}
|
||||
10
translate_cache/rl/ppo/gae.ja.json
Normal file
10
translate_cache/rl/ppo/gae.ja.json
Normal file
@ -0,0 +1,10 @@
|
||||
{
|
||||
"<h1>Generalized Advantage Estimation (GAE)</h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of paper <a href=\"https://papers.labml.ai/paper/1506.02438\">Generalized Advantage Estimation</a>.</p>\n<p>You can find an experiment that uses it <a href=\"experiment.html\">here</a>.</p>\n": "<h1>\u4e00\u822c\u5316\u512a\u4f4d\u6027\u63a8\u5b9a (GAE)</h1>\n<p><a href=\"https://pytorch.org\"><a href=\"https://papers.labml.ai/paper/1506.02438\">\u3053\u308c\u306f\u7d19\u306e\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3092PyTorch\u3067\u5b9f\u88c5\u3057\u305f\u3082\u306e\u3067\u3059</a></a>\u3002</p>\n<p><a href=\"experiment.html\">\u3053\u308c\u3092\u4f7f\u3063\u305f\u5b9f\u9a13\u306f\u3053\u3061\u3089\u304b\u3089\u3054\u89a7\u3044\u305f\u3060\u3051\u307e\u3059</a>\u3002</p>\n",
|
||||
"<h3>Calculate advantages</h3>\n<span translate=no>_^_0_^_</span><p><span translate=no>_^_1_^_</span> is high bias, low variance, whilst <span translate=no>_^_2_^_</span> is unbiased, high variance.</p>\n<p>We take a weighted average of <span translate=no>_^_3_^_</span> to balance bias and variance. This is called Generalized Advantage Estimation. <span translate=no>_^_4_^_</span> We set <span translate=no>_^_5_^_</span>, this gives clean calculation for <span translate=no>_^_6_^_</span></p>\n<span translate=no>_^_7_^_</span>": "<h3>\u5229\u70b9\u3092\u8a08\u7b97</h3>\n<span translate=no>_^_0_^_</span><p><span translate=no>_^_1_^_</span>\u30d0\u30a4\u30a2\u30b9\u304c\u9ad8\u304f\u5206\u6563\u304c\u5c0f\u3055\u304f\u3001\u504f\u308a\u304c\u306a\u304f\u3001<span translate=no>_^_2_^_</span>\u5206\u6563\u304c\u5927\u304d\u3044\u3002</p>\n<p><span translate=no>_^_3_^_</span>\u30d0\u30a4\u30a2\u30b9\u3068\u5206\u6563\u306e\u30d0\u30e9\u30f3\u30b9\u3092\u53d6\u308b\u305f\u3081\u306b\u3001\u52a0\u91cd\u5e73\u5747\u3092\u53d6\u308a\u307e\u3059\u3002\u3053\u308c\u306f\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3068\u547c\u3070\u308c\u307e\u3059\u3002<span translate=no>_^_4_^_</span>\u8a2d\u5b9a\u3057\u307e\u3057\u305f\u3002\u3053\u308c\u306b\u3088\u308a<span translate=no>_^_5_^_</span>\u3001\u8a08\u7b97\u304c\u304d\u308c\u3044\u306b\u306a\u308a\u307e\u3059 <span translate=no>_^_6_^_</span></p>\n<span translate=no>_^_7_^_</span>",
|
||||
"<p><span translate=no>_^_0_^_</span> </p>\n": "<p><span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>advantages table </p>\n": "<p>\u5229\u70b9\u8868</p>\n",
|
||||
"<p>mask if episode completed after step <span translate=no>_^_0_^_</span> </p>\n": "<p>\u30b9\u30c6\u30c3\u30d7\u306e\u5f8c\u306b\u30a8\u30d4\u30bd\u30fc\u30c9\u304c\u5b8c\u4e86\u3057\u305f\u5834\u5408\u306f\u30de\u30b9\u30af <span translate=no>_^_0_^_</span></p>\n",
|
||||
"<p>note that we are collecting in reverse order. <em>My initial code was appending to a list and I forgot to reverse it later. It took me around 4 to 5 hours to find the bug. The performance of the model was improving slightly during initial runs, probably because the samples are similar.</em> </p>\n": "<p>\u9006\u306e\u9806\u5e8f\u3067\u53ce\u96c6\u3057\u3066\u3044\u308b\u3053\u3068\u306b\u6ce8\u610f\u3057\u3066\u304f\u3060\u3055\u3044\u3002<em>\u6700\u521d\u306e\u30b3\u30fc\u30c9\u306f\u30ea\u30b9\u30c8\u306b\u8ffd\u52a0\u3055\u308c\u3066\u3044\u3066\u3001\u5f8c\u3067\u5143\u306b\u623b\u3059\u306e\u3092\u5fd8\u308c\u307e\u3057\u305f\u3002\u30d0\u30b0\u3092\u898b\u3064\u3051\u308b\u306e\u306b\u7d044\u301c5\u6642\u9593\u304b\u304b\u308a\u307e\u3057\u305f\u3002\u30e2\u30c7\u30eb\u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u306f\u3001\u304a\u305d\u3089\u304f\u30b5\u30f3\u30d7\u30eb\u304c\u4f3c\u3066\u3044\u308b\u305f\u3081\u304b\u3001\u6700\u521d\u306e\u5b9f\u884c\u6642\u306b\u308f\u305a\u304b\u306b\u5411\u4e0a\u3057\u3066\u3044\u307e\u3057\u305f\u3002</em></p>\n",
|
||||
"A PyTorch implementation/tutorial of Generalized Advantage Estimation (GAE).": "\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a (GAE) \u306e PyTorch \u5b9f\u88c5/\u30c1\u30e5\u30fc\u30c8\u30ea\u30a2\u30eb\u3002",
|
||||
"Generalized Advantage Estimation (GAE)": "\u4e00\u822c\u5316\u512a\u4f4d\u6027\u63a8\u5b9a (GAE)"
|
||||
}
|
||||
4
translate_cache/rl/ppo/readme.ja.json
Normal file
4
translate_cache/rl/ppo/readme.ja.json
Normal file
@ -0,0 +1,4 @@
|
||||
{
|
||||
"<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">Proximal Policy Optimization - PPO</a></h1>\n<p>This is a <a href=\"https://pytorch.org\">PyTorch</a> implementation of <a href=\"https://papers.labml.ai/paper/1707.06347\">Proximal Policy Optimization - PPO</a>.</p>\n<p>PPO is a policy gradient method for reinforcement learning. Simple policy gradient methods one do a single gradient update per sample (or a set of samples). Doing multiple gradient steps for a singe sample causes problems because the policy deviates too much producing a bad policy. PPO lets us do multiple gradient updates per sample by trying to keep the policy close to the policy that was used to sample data. It does so by clipping gradient flow if the updated policy is not close to the policy used to sample the data.</p>\n<p>You can find an experiment that uses it <a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">here</a>. The experiment uses <a href=\"https://nn.labml.ai/rl/ppo/gae.html\">Generalized Advantage Estimation</a>.</p>\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a> </p>\n": "<h1><a href=\"https://nn.labml.ai/rl/ppo/index.html\">\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO</a></h1>\n<p><a href=\"https://papers.labml.ai/paper/1707.06347\">\u3053\u308c\u306f\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316</a>\uff08PPO\uff09<a href=\"https://pytorch.org\">\u306ePyTorch\u5b9f\u88c5\u3067\u3059</a>\u3002</p>\n<p>PPO\u306f\u5f37\u5316\u5b66\u7fd2\u306e\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u6cd5\u3067\u3059\u3002\u5358\u7d14\u306a\u30dd\u30ea\u30b7\u30fc\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30e1\u30bd\u30c3\u30c9\u3067\u306f\u3001\u30b5\u30f3\u30d7\u30eb\uff08\u307e\u305f\u306f\u30b5\u30f3\u30d7\u30eb\u306e\u30bb\u30c3\u30c8\uff09\u3054\u3068\u306b1\u3064\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3044\u307e\u3059\u30021\u3064\u306e\u30b5\u30f3\u30d7\u30eb\u306b\u5bfe\u3057\u3066\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30b9\u30c6\u30c3\u30d7\u3092\u5b9f\u884c\u3059\u308b\u3068\u3001\u30dd\u30ea\u30b7\u30fc\u306e\u504f\u5dee\u304c\u5927\u304d\u3059\u304e\u3066\u4e0d\u9069\u5207\u306a\u30dd\u30ea\u30b7\u30fc\u304c\u751f\u6210\u3055\u308c\u308b\u305f\u3081\u3001\u554f\u984c\u304c\u767a\u751f\u3057\u307e\u3059\u3002PPO \u3067\u306f\u3001\u30dd\u30ea\u30b7\u30fc\u3092\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3057\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u8fd1\u3044\u72b6\u614b\u306b\u4fdd\u3064\u3053\u3068\u3067\u3001\u30b5\u30f3\u30d7\u30eb\u3054\u3068\u306b\u8907\u6570\u306e\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u66f4\u65b0\u3092\u884c\u3046\u3053\u3068\u304c\u3067\u304d\u307e\u3059\u3002\u66f4\u65b0\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u304c\u30c7\u30fc\u30bf\u306e\u30b5\u30f3\u30d7\u30ea\u30f3\u30b0\u306b\u4f7f\u7528\u3055\u308c\u305f\u30dd\u30ea\u30b7\u30fc\u306b\u5408\u308f\u306a\u3044\u5834\u5408\u306f\u3001\u30b0\u30e9\u30c7\u30fc\u30b7\u30e7\u30f3\u30d5\u30ed\u30fc\u3092\u30af\u30ea\u30c3\u30d4\u30f3\u30b0\u3057\u3066\u66f4\u65b0\u3057\u307e\u3059</p>\u3002\n<p><a href=\"https://nn.labml.ai/rl/ppo/experiment.html\">\u3053\u308c\u3092\u4f7f\u3063\u305f\u5b9f\u9a13\u306f\u3053\u3061\u3089\u304b\u3089\u3054\u89a7\u3044\u305f\u3060\u3051\u307e\u3059</a>\u3002\u3053\u306e\u5b9f\u9a13\u3067\u306f\u3001<a href=\"https://nn.labml.ai/rl/ppo/gae.html\">\u4e00\u822c\u5316\u30a2\u30c9\u30d0\u30f3\u30c6\u30fc\u30b8\u63a8\u5b9a\u3092\u4f7f\u7528\u3057\u3066\u3044\u307e\u3059</a></p>\u3002\n<p><a href=\"https://colab.research.google.com/github/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/rl/ppo/experiment.ipynb\"><span translate=no>_^_0_^_</span></a></p>\n",
|
||||
"Proximal Policy Optimization - PPO": "\u8fd1\u63a5\u30dd\u30ea\u30b7\u30fc\u6700\u9069\u5316-PPO"
|
||||
}
|
||||
Reference in New Issue
Block a user