其他分享
首页 > 其他分享> > Udacity DRL DQN

Udacity DRL DQN

作者:互联网

I. DQN要点

A. DQN文章细节要点

  1. Pre-processing: 将原始输入转化为正方形照片,这样可以使用gpu加速训练。
  2. Frames Stacking : 将连续的四帧图片打包成一个输入,即84 * 84 * 4,因此DQN也可以capture time correlation.
  3. Frame-Skipping Technique:训练时,每4帧选择一次动作
  4. Experience Replay:
    • data efficient
    • 打破连续state的correlation,减少了q function的overfitting。
    • 在某个策略下,得到的transition tuple的分布不均匀,而buffer随机采样tuple可以使得其分布均匀,在某个策略下(或者说某个范围的策略参数),得到的transition tuple的分布比较集中,情况不均匀,而buffer随机采样tuple可以使得其分布均匀,减少overfitting。
  5. Fixed Q Target: 使得target不能被开导,否则无法收敛。
  6. Off policy method: 用来更新策略的transition tuple并非由当前策略采样得到。
  7. Reward Clipped
  8. Error Clipped
  9. Target Networks Soft Update

B. Advanced DQN:

  1. Double Q Learning
  2. Dueling DQN
  3. Multi-step Return
  4. Rainbow

II. 新、重、难点解释

3. Frame-Skipping Technique

4. Experience Replay

优点:

  1. data efficient
  2. 打破连续state的correlation,减少了q function的overfitting。
  3. 在某个策略下,得到的transition tuple的分布不均匀,而buffer随机采样tuple可以使得其分布均匀,在某个策略下(或者说某个范围的策略参数),得到的transition tuple的分布比较集中,情况不均匀,而buffer随机采样tuple可以使得其分布均匀,减少overfitting。

5. Fixed Q Target

Experience replay helps us address one type of correlation.That is between consecutive experience tuples.There is another kind of correlation that Q-learning is susceptible (易受影响) to.

Q-learning is a form of Temporal Difference or TD learning, right? Now, the TD target here is supposed to be a replacement for the true Q function (Q pi) which is unknown. We originally used Q pi to define a squared error loss, and differentiated that with respect to w to get our gradient descent update rule. Now, Q pi is not dependent on our function approximation or its parameters, thus resulting in a simple derivative, an update rule. But, our TD target is dependent on these parameters which means simply replacing the Q pi with a target like this is mathematically incorrect.

You may be thinking, "Doesn't experience replay take care of this problem?"Well, it addresses a similar but slightly different issue.There we broke the correlation effects between consecutive experience tuples by sampling them randomly out of order. Here, the correlation is between the target and the parameters we are changing. (The target is moving! )

Solution: Set the target net separately periodically, decoupling the target from the parameters.

6. Fixed Target Net

SAMPLE: 采样 transition tuple并存储到buffer中
LEARN: 从buffer中随机采样tuple来更新 Q function。

由于这两步不直接互相依赖,所以是off policy。

In the 'LEARN', we select the small batch of tuples from this memory, randomly, and learn from that batch using a gradient descent update step. These two processes are not directly dependent on each other.So, you could perform multiple sampling steps then one learning step ,or even multiple learning steps with different random batches.

7. Reward Clipping

8. Error Clipping

III Advanced DQN

3. Dueling DQN

标签:function,target,tuple,Udacity,DRL,learning,correlation,DQN
来源: https://www.cnblogs.com/bourne_lin/p/Udacity-DRL-DQN.html