首页 > 其他分享> > [RL 13] VDN (201706)

[RL 13] VDN (201706)

2021-01-17 13:00:09 作者：互联网

paper: Value-Decomposition Networks For Cooperative Multi-Agent Learning

背景

cooperative setting (reward 相同)
centralized MARL approach 存在不足
- 可能会出现 lazy agent: lazy agent 的 exploration 可能导致 reward 变低
independent learning 存在不足
- non-stationary
- spurious reward signals
Dec-POMDP framework
1. agent acts with local Qi(hi, ai)
问题: 如何学习Qi ? 该论文提出了一种方法

3 A Deep-RL Architecture for Coop-MARL

假设 Qi 和 Qtot 之间有如下关系:
来自源论文
我们就可以通过下式训练Qi
在这里插入图片描述
其中

Qtot为: n个max Qi(h, a)之和, h由buffer中的序列构造, a也来自buffer
max Qtot为: n个Qi的max Qi之和, h’ 也来自buffer

PS:

agent与环境互动时只需要使用Qi
当agent数量n变大时, max操作的复杂度只是线性增加 (因为max操作遍历agent, 而不是遍历joint action space |A|^n)
Qi 之间 weight sharing: nn输入agent的onehot编码
LSTM 只输入 observation
communication layer

不足

Qtot 与 Qi 之间的关系假设太特殊

标签：13,max,lazy,agent,Qi,201706,RL,reward,Qtot
来源： https://blog.csdn.net/xyp99/article/details/112735729