其他分享
首页 > 其他分享> > 论文记载: Deep Reinforcement Learning for Traffic LightControl in Vehicular Networks

论文记载: Deep Reinforcement Learning for Traffic LightControl in Vehicular Networks

作者:互联网

强化学习论文记载

论文名: Deep Reinforcement Learning for Traffic LightControl in Vehicular Networks ( 车辆网络交通信号灯控制的深度强化学习 )---年份:2018.3

主要内容:

文献综述载于第二节。模型和问题陈述将在第三节介绍。第四节介绍了强化学习的背景,第五节介绍了在车辆网络交通灯控制系统中建立强化学习模型的细节。第六节将强化学习模型扩展为深度学习模型,以处理我们系统中的复杂状态。该模型的评估载于第七节。最后,论文的结论在第八节。

Introduction:

Literature review:

Model and problem statement:

交通信号灯模型

Background on reinforcement learning:

Reinforcement learning model:

状态States:

图2(a)

动作Action:

\[T_{yellow}=\frac{v_{max}}{a_{dec}} \tag {3} \]

奖励Rewards

\[r_t=W_t-W_{t+1} \tag {4} \]

\[W_t= \sum_{i_t=1}^{N_t}w_{i_t,t} \tag {5} \]

Double dueling deep q network:

Convolutional Neural Network:

\[f(n)= \begin{cases} x, & \text {if x>0} \\ \beta x, & \text{if n $\leq$ 0} \end{cases} \tag {6} \]

Dueling DQN

\[Q(s,a;\theta)=V(s;\theta)+(A(s,a;\theta)-\frac{1}{|A|}\sum_{a'}A(s,a';\theta)) \tag7 \]

Target Network

\[J=\sum_s P(s)[Q_{target}(s,a)-Q(s,a;\theta)]^2 \tag8 \]

\[\theta^-=\alpha\theta^-+(1-\alpha)\theta \tag9 \]

Double DQN

\[Q_{target}(s,a)=r+\gamma Q(s',argmax_{a'}(Q(s',a';\theta)),\theta^-) \tag{10} \]

Fig.5 个人的一些理解:

Primary CNN主要用来更新\(Q(s,a;\theta)\)值

Target CNN主要用来更新神经网络中的参数

1》输入当前状态和试探性的动作,经过Primary CNN,更新每个试探性动作的\(Q(s,a;\theta)\)值

2》来选择一个\(Q(s,a;\theta)\)值最大的\(action\)

3》观察下一个状态\(state\)和奖励回报\(reward\)

4》将四元组\(<s,a,r,s'>\)保存到经验池里

5》通过优先级进行选择一个mini-batch的数据

6》一个Primary CNN通过输入均方差损失函数MSE更新后的\(\theta\),来更新\(\theta^-\),并且通过当前状态的s,a,\(\theta\),来获得当前状态的\(Q(s,a;\theta)\)值

7》另一个Primary CNN通过输入下一个状态s',来更新下一个动作a'

7》Target CNN通过输入上述的s',a',\(\theta^-\),来获取\(Q(s',a';\theta^-)\)值

8》 将mini-batch里的r与上述的\(Q(s',a';\theta^-)\)结合获得\(Q_{target}(s,a)\)

9》使用均方差损失函数MSE(\(Q(s,a;\theta)\),\(Q_{target}(s,a)\))来更新神经网络的参数\(\theta\)

Prioritized Experience Replay

\[\delta_i=|Q(s,a;\theta)_i-Q_{target}(s,a)_i| \tag{11} \]

\[P_i=\frac{p_i^\tau}{\sum_k p_k^\tau} \tag{12} \]

Optimization

\[g=\nabla_\theta J(\theta) \tag{13} \]

\[s=\rho_ss+(1-\rho_s)g\\ r=\rho_rr+(1-\rho_r)g \tag{14} \]

\[\hat s=\frac{s}{1-\rho_s^t}\\ \hat r=\frac{r}{1-\rho_r^t} \tag{15} \]

\[\theta=\theta+\Delta\theta\\ =\theta+(-\epsilon_r\frac{\hat s}{\sqrt{\hat r+\delta}}) \tag{16} \]

Overall Architecture

Algorithm 1 Dueling Double Deep Q Network with Priori-tized Experience Replay Algorithm on a Traffic Light:

Input: replay memory size \(M\), minibatch size \(B\), greedy \(\epsilon\), pre-train steps \(tp\), target network update rate \(α\), discount factor \(γ\).
Notations:
\(θ\): the parameters in the primary neural network.
\(θ^−\): the parameters in the target neural network.
\(m\): the replay memory.
\(i\): step number.

Initialize parameters \(θ\),$ θ^−$with random values.
Initialize \(m\) to be empty and \(i\) to be zero.
Initialize \(s\) with the starting scenario at the intersection.
while there exists \(a\) state \(s\) do
Choose an action \(a\) according to the \(\epsilon\) greedy.
Take action \(a\) and observe reward \(r\) and new state \(s′\).
if the size of memory m > M then
Remove the oldest experiences in the memory.
end if
Add the four-tuple \(<s, a, r, s′>\) into \(M\).
Assign \(s′\) to \(s\): \(s ← s′\).
\(i ← i + 1\).
if \(|M| > B\) and \(i > tp\) then
Select \(B\) samples from \(m\) based on the sampling priorities.
Calculate the loss \(J\):
\(J =\sum_s \frac{1}{B}[r + γQ(s′,argmax_{a′}(Q(s′, a′;θ)), θ^−)−Q(s, a;θ)]^2.\)
Update \(θ\) with \(∇J\) using Adam back propagation.
Update \(θ^−\)with \(θ\):
\(θ^−= αθ^−+ (1 − α)θ\).
Update every experience’s sampling priority based on \(δ\).
Update the value of \(\epsilon\).
end if
end while

Evalution:

Evaluation Methodology and Parameters

我们进行仿真的主要目的是:

Experimental Results

Fig7 1

CONCLUSION:

标签:模型,交通灯,学习,奖励,Vehicular,Traffic,Reinforcement,车辆,theta
来源: https://www.cnblogs.com/52dxer/p/14211671.html