其他分享
首页 > 其他分享> > RL(Chapter 7): n-step Bootstrapping (n步自举法)

RL(Chapter 7): n-step Bootstrapping (n步自举法)

作者:互联网

本文为强化学习笔记,主要参考以下内容:

目录

n n n-step Bootstrapping

n n n-step TD Prediction

在这里插入图片描述
Consider the update of the estimated value of state S t S_t St​ as a result of the state–reward sequence, S t , R t + 1 , S t + 1 , R t + 2 , . . . , R T , S T S_t,R_{t+1}, S_{t+1},R_{t+2}, . . . ,R_T , S_T St​,Rt+1​,St+1​,Rt+2​,...,RT​,ST​ (omitting the actions).

The subscripts on G t : t + 1 G_{t:t+1} Gt:t+1​ indicate that it is a truncated return for time t t t using rewards up until time t + 1 t+1 t+1, with the discounted estimate γ V t ( S t + 1 ) \gamma V_t(S_{t+1}) γVt​(St+1​) taking the place of the other terms γ R t + 2 + γ 2 R t + 3 + ⋅ ⋅ ⋅ + γ T − t − 1 R T \gamma R_{t+2} +\gamma^2 R_{t+3} +· · ·+\gamma^{T−t−1}R_T γRt+2​+γ2Rt+3​+⋅⋅⋅+γT−t−1RT​ of the full return G t G_t Gt​.

All n n n-step returns can be considered approximations to the full return, truncated after n n n steps and then corrected for the remaining missing terms by V t + n − 1 ( S t + n ) V_{t+n−1}(S_{t+n}) Vt+n−1​(St+n​). If t + n > T t + n> T t+n>T (if the n n n-step return extends to or beyond termination), then the n n n-step return defined to be equal to the ordinary full return ( G t : t + n = G t   i f   t + n > T G_{t:t+n}=G_t\ if\ t + n > T Gt:t+n​=Gt​ if t+n>T).

Note that no changes at all are made during the first n − 1 n − 1 n−1 steps of each episode. To make up for that, an equal number of additional updates are made at the end of the episode, after termination and before starting the next episode.

在这里插入图片描述
Exercise 7.1
In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the n n n-step error used in (7.2) can also be written as a sum TD errors (again if the value estimates don’t change) generalizing the earlier result.
ANSWER
在这里插入图片描述


e r r o r error error r e d u c t i o n reduction reduction p r o p e r t y property property of n n n-step returns

在这里插入图片描述

n n n-step Sarsa

The n n n-step version of Sarsa we call n n n-step Sarsa, and the original version presented in the previous chapter we henceforth call o n e one one- s t e p step step S a r s a Sarsa Sarsa, or S a r s a ( 0 ) Sarsa(0) Sarsa(0).

在这里插入图片描述with G t : t + n = G t   i f   t + n ≥ T G_{t:t+n}=G_t\ if\ t + n\geq T Gt:t+n​=Gt​ if t+n≥T.

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

Exercise 7.4
Prove that the n n n-step return of Sarsa (7.4) can be written exactly in terms of a novel TD error, as

在这里插入图片描述
ANSWER

在这里插入图片描述


n n n-step Expected Sarsa

The backup diagram for the n n n-step version of Expected Sarsa is shown on the far right in Figure 7.3. It consists of a linear string of sample actions and states, just as in n n n-step Sarsa, except that its last element is a branch over all action possibilities weighted by their probability under π \pi π. This algorithm can be described by the same equation as n n n-step Sarsa (above) except with the n n n-step return redefined as

在这里插入图片描述
(with G t : t + n = G t   i f   t + n ≥ T G_{t:t+n}=G_t\ if\ t + n\geq T Gt:t+n​=Gt​ if t+n≥T) where

在这里插入图片描述

n n n-step Off-policy Learning

In n n n-step methods, returns are constructed over n n n steps, so we are interested in the relative probability of just those n n n actions. For example, to make a simple off-policy version of n n n-step TD, the update for time t t t (actually made at time t + n t + n t+n) can simply be weighted by ρ t : t + n − 1 \rho_{t:t+n−1} ρt:t+n−1​:

在这里插入图片描述
where ρ t : t + n − 1 \rho_{t:t+n−1} ρt:t+n−1​, called the importance sampling ratio, is the relative probability under the two policies of taking the n n n actions from A t A_t At​ to A t + n − 1 A_{t+n−1} At+n−1​

在这里插入图片描述
Similarly, our previous n n n-step Sarsa update can be completely replaced by a simple off-policy form:

在这里插入图片描述
Note that the importance sampling ratio here starts and ends one step later than for n n n-step TD (7.9). This is because here we are updating a state–action pair. We do not have to care how likely we were to select the action; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions.

在这里插入图片描述


The off-policy version of n n n-step Expected Sarsa would use the same update as above for n n n-step Sarsa except that the importance sampling ratio would have one less factor in it. That is, the above equation would use ρ t + 1 : t + n − 1 \rho_{t+1:t+n−1} ρt+1:t+n−1​ instead of ρ t + 1 : t + n \rho_{t+1:t+n} ρt+1:t+n​, and of course it would use the Expected Sarsa version of the n n n-step return (7.7). This is because in Expected Sarsa all possible actions are taken into account in the last state; the one actually taken has no effect and does not have to be corrected for.

Off-policy Learning Without Importance Sampling: The n n n-step Tree Backup Algorithm

n n n 步树回溯算法

The idea of the algorithm is suggested by the 3-step tree-backup backup diagram:

在这里插入图片描述

Because we have no sample data for the unselected actions, we bootstrap and use the estimates of their values in forming the target for the update. This slightly extends the idea of a backup diagram.


We can think of the 3-step tree-backup update as consisting of 6 half-steps, alternating between sample half-steps from an action to a subsequent state, and expected half-steps considering from that state all possible actions with their probabilities of occurring under the policy.

Now let us develop the detailed equations for the n n n-step tree-backup algorithm. The one-step return (target) is the same as that of Expected Sarsa,

在这里插入图片描述
for t < T − 1 t < T − 1 t<T−1, and the two-step tree-backup return is

在这里插入图片描述
for t < T − 2 t < T −2 t<T−2. The latter form suggests the general recursive definition of the tree-backup n n n-step return:

在这里插入图片描述
for t < T − 1 , n ≥ 2 t < T − 1, n \geq 2 t<T−1,n≥2, with the n = 1 n = 1 n=1 case handled by (7.15) except for G T − 1 : t + n = R T G_{T−1:t+n}=R_T GT−1:t+n​=RT​. This target is then used with the usual action-value update rule from n n n-step Sarsa:

在这里插入图片描述
for 0 ≤ t < T 0\leq t < T 0≤t<T.

在这里插入图片描述

Exercise 7.11
Show that if the approximate action values are unchanging, then the tree-backup return (7.16) can be written as a sum of expectation-based TD errors:

在这里插入图片描述
where δ t = R t + 1 + γ V t ‾ ( S t + 1 ) − Q ( S t , A t ) \delta_t=R_{t+1} +\gamma \overline{V_t}(S_{t+1}) − Q(S_t,A_t) δt​=Rt+1​+γVt​​(St+1​)−Q(St​,At​) and V ‾ t \overline V_t Vt​ is given by (7.8).

在这里插入图片描述

A Unifying Algorithm: n n n-step Q ( σ ) Q(\sigma) Q(σ)

So far in this chapter we have considered three different kinds of action-value algorithms, corresponding to the first three backup diagrams shown in Figure 7.5.

在这里插入图片描述


To what extent can these algorithms be unified?


To increase the possibilities even further we can consider a continuous variation between sampling and expectation.


Now let us develop the equations of n n n-step Q ( σ ) Q(\sigma) Q(σ).

在这里插入图片描述
after which it is exactly like the n n n-step return for Sarsa with control variates (7.14) except with the action probability π ( A t + 1 ∣ S t + 1 ) \pi(A_{t+1}|S_{t+1}) π(At+1​∣St+1​) substituted for the importance-sampling ratio ρ t + 1 \rho_{t+1} ρt+1​. For Q ( σ ) Q(\sigma) Q(σ), we slide linearly between these two cases:

在这里插入图片描述
for t < h ≤ T t < h \leq T t<h≤T. The recursion ends with G h : h = Q h − 1 ( S h , A h ) G_{h:h}=Q_{h−1}(S_h,A_h) Gh:h​=Qh−1​(Sh​,Ah​) if h < T h < T h<T, or with
G T − 1 : T = R T G_{T−1}:T=R_T GT−1​:T=RT​ if h = T h = T h=T.

Then we use the earlier update for n n n-step Sarsa without importance-sampling ratios (7.5) instead of (7.11), because now the ratios are incorporated in the n n n-step return.

在这里插入图片描述

In Chapter 12, we will see how multi-step TD methods can be implemented with minimal memory and computational complexity using eligibility traces.

标签:Chapter,return,methods,Bootstrapping,步自,step,Sarsa,action,TD
来源: https://blog.csdn.net/weixin_42437114/article/details/109739303