Udacity DRL DQN
作者:互联网
I. DQN要点
A. DQN文章细节要点
- Pre-processing: 将原始输入转化为正方形照片,这样可以使用gpu加速训练。
- Frames Stacking : 将连续的四帧图片打包成一个输入,即84 * 84 * 4,因此DQN也可以capture time correlation.
- Frame-Skipping Technique:训练时,每4帧选择一次动作
- Experience Replay:
- data efficient
- 打破连续state的correlation,减少了q function的overfitting。
- 在某个策略下,得到的transition tuple的分布不均匀,而buffer随机采样tuple可以使得其分布均匀,在某个策略下(或者说某个范围的策略参数),得到的transition tuple的分布比较集中,情况不均匀,而buffer随机采样tuple可以使得其分布均匀,减少overfitting。
- Fixed Q Target: 使得target不能被开导,否则无法收敛。
- Off policy method: 用来更新策略的transition tuple并非由当前策略采样得到。
- Reward Clipped
- Error Clipped
- Target Networks Soft Update
B. Advanced DQN:
- Double Q Learning
- Dueling DQN
- Multi-step Return
- Rainbow
II. 新、重、难点解释
3. Frame-Skipping Technique

4. Experience Replay
优点:
- data efficient
- 打破连续state的correlation,减少了q function的overfitting。
- 在某个策略下,得到的transition tuple的分布不均匀,而buffer随机采样tuple可以使得其分布均匀,在某个策略下(或者说某个范围的策略参数),得到的transition tuple的分布比较集中,情况不均匀,而buffer随机采样tuple可以使得其分布均匀,减少overfitting。 
5. Fixed Q Target
Experience replay helps us address one type of correlation.That is between consecutive experience tuples.There is another kind of correlation that Q-learning is susceptible (易受影响) to.
Q-learning is a form of Temporal Difference or TD learning, right? Now, the TD target here is supposed to be a replacement for the true Q function (Q pi) which is unknown. We originally used Q pi to define a squared error loss, and differentiated that with respect to w to get our gradient descent update rule. Now, Q pi is not dependent on our function approximation or its parameters, thus resulting in a simple derivative, an update rule. But, our TD target is dependent on these parameters which means simply replacing the Q pi with a target like this is mathematically incorrect.

- Moving Target: It can affect learning significantly when we use function approximation, where all the Q-values are intrinsically tied together through the function parameters. (Do no harm to Q table representation since every Q-values are stored separately.)
You may be thinking, "Doesn't experience replay take care of this problem?"Well, it addresses a similar but slightly different issue.There we broke the correlation effects between consecutive experience tuples by sampling them randomly out of order. Here, the correlation is between the target and the parameters we are changing. (The target is moving! )
Solution: Set the target net separately periodically, decoupling the target from the parameters.


6. Fixed Target Net
SAMPLE: 采样 transition tuple并存储到buffer中
LEARN: 从buffer中随机采样tuple来更新 Q function。
由于这两步不直接互相依赖,所以是off policy。
In the 'LEARN', we select the small batch of tuples from this memory, randomly, and learn from that batch using a gradient descent update step. These two processes are not directly dependent on each other.So, you could perform multiple sampling steps then one learning step ,or even multiple learning steps with different random batches.

7. Reward Clipping

8. Error Clipping

III Advanced DQN
3. Dueling DQN
- 改动:Dueling DQN只是比原来的DQN多了一层结构: V和A。其中V值为Q的均值,A为Q-V。
- 好处:提高泛化能力,增加样本利用效率。
- 例子:当更新Q值后,比如第二列的前两个由Q值都增加了1,那么我们只需要在V增加1,而A不改动,这样带来的结果就是Q中的第三个值也增加了1,无形之中就利用了两个样本更新了3个值。 
标签:function,target,tuple,Udacity,DRL,learning,correlation,DQN 来源: https://www.cnblogs.com/bourne_lin/p/Udacity-DRL-DQN.html