首页 > 其他分享> > 强化学习基础：蒙特卡罗和时序差分

强化学习基础：蒙特卡罗和时序差分

2019-06-27 20:52:03 作者：互联网

这篇文章承接文章强化学习基础：基本概念和动态规划，介绍另外两种解决强化学习问题的方法

求解方法：Monte Carlo

问题一（左图）：estimate the state-value function $v_{\pi}$ corresponding to a policy $\pi$
- First-visit MC estimates $v_{\pi}(s)$ as the average of the returns following only first visits to $s$ (ignores returns that are associated to later visits)
- Every-visit MC estimates $v_{\pi}(s)$ as the average of the returns following all visits to $s$
问题二（右图）：estimate the action-value function $q_{\pi}$ corresponding to a policy $\pi$
- First-visit MC estimates $q_{\pi}(s,a)$ as the average of the returns following only first visits to $s,a$
- Every-visit MC estimates $q_{\pi}(s,a)$ as the average of the returns following all visits to $s,a$

问题三（左图）：get the optimal policy $\pi_*$
- relationship between the mean and individual return: $\bar{Q}_k=\frac{\sum_{i=1}^kG_i}{k}=\bar{Q}_{k-1}+\frac{1}{k}(G_k-\bar{Q}_{k-1})$
- $\epsilon$-greedy: Exploration vs Exploitation
  - with probability $1-\epsilon$, select the greedy action ${\pi}(s)=\arg \max _{a \in \mathcal{A}(s)} Q(s, a)$ (Exploitation)
  - with probability $\epsilon$, select an action (uniformly) at random ${\pi}(a|s)=\frac{1}{|\mathcal{A}(s)|}$ (Exploration)　　
问题四（右图）：modify the algorithm to put more weights to the most recent returns

求解方法：Temporal Difference

Monte Carlo (MC) prediction methods must wait until the end of an episode to update the value function estimate, temporal-difference (TD) methods update the value function after every time step.

问题一（左图）：estimate the state-value function $v_{\pi}$ (the estimation of $q_{\pi}$ is similar)
问题二（右图）：get the optimal action value function $q_*$
- On policy: the agent interact with the environment by following the same policy $\pi$ that it seeks to evaluate (or improve)
- Sarsa(0) is an on-policy method

问题三：modified algorithm to get the optimal action value function $q_*$
- Off poliy: the agent interact with the environment by following a policy $b$ that is different from the policy $\pi$ that it seeks to evaluate (or improve)
- Sarsamax(i.e., Q-learning) is an off-policy method

问题四：another modified algorithm to get the optimal action value function $q_*$
- Expected Sarsa is an on-policy method
- $\pi(a|S_{t+1})$ is derived from $Q$ (e.g., $\epsilon$-greedy)

标签：function,差分,时序,action,returns,value,蒙特卡罗,policy,pi
来源： https://www.cnblogs.com/sunwq06/p/11084512.html