跳至主要內容

Chapter 7 Temporal-Difference learning

RyanLee_ljx...大约 9 分钟RL

Chapter 7 Temporal-Difference learning

In this section we will first introduce TD learning, which refers a wide range of algorithms. It can solve Bellman equation of a given policy π\pi without model. We refer TD learning in the first chapter specifically as a classic algorithm for estimating state values. Then we will introduce other algorithms belonging to the wide range of TD learning in the next section.

Basic TD learning algorithms

Recall the RM algorithm we learn in Chapter 6:

wk+1=wkαkg~(wk,ηk), w_{k+1}=w_k-\alpha _k \tilde g(w_k,\eta_k),

where g~\tilde g is the sample of our optimizing objective obtained through the model (e.g., expectation in mean estimation or the gradient expectation in the gradient descent).

In TD learning, our objective is estimating the state value. Recall that in the model-based methods, we derive state values through solving the Bellman equation (BE). But here we do not have models (TD learning is model free), so in essence, it is a special stochastic approximation algorithm for solving the BE. Recall the definition of the state value is

vπ(s)=E[Rt+1+γGt+1St=s],sS. v_{\pi}(s) = \mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_t = s], \quad s \in \mathcal{S}.

We can write it into another form by definition:

vπ(s)=E[Rt+1+γvπ(St+1)St=s],sS. v_{\pi}(s) = \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1})|S_t = s], \quad s \in \mathcal{S}.

As we can see, the value inside the expectation is just the definition of action value. So it can also written as:

vπ(s)=E[q(s,A)],sS, v_{\pi}(s) = \mathbb{E}[q(s, A)], \quad s \in \mathcal{S},

which we will see it in the Chapter 10. From this persepctive, we can also clearly understand the relationship between action value and state value.

So our objective is to find the root of vπ(s)v_{\pi}(s). That is:

g(vπ(s))=vπ(s)E[Rt+1+γvπ(St+1)St=s],sS=0. g(v_\pi(s)) = v_{\pi}(s) - \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1})|S_t = s], \quad s \in \mathcal{S} = 0.

The sample of g(vπ(s))g(v_\pi(s)) is just vt(st)rt+1+γvt(st+1)v_{t}(s_t) - r_{t+1} + \gamma v_{t}(s_{t+1}). We replace this sample expectation with g~\tilde g in RM, and finally we got the expression of TD learning algorithm:

vt+1(st)new estimate=vt(st)current estimateαt(st)[vt(st)(rt+1+γvt(st+1)TD target vˉt)TD error δt], \underbrace{v_{t+1}(s_t)}_{\text{new estimate}} = \underbrace{v_t(s_t)}_{\text{current estimate}} - \alpha_t(s_t) \big[ \overbrace{v_t(s_t) - ( \underbrace{r_{t+1} + \gamma v_t(s_{t+1})}_{\text{TD target } \bar{v}_t} )}^{\text{TD error } \delta_t} \big],

here, vˉt\bar v_t is the target value that the algorithm attempts to drive v(st)v(s_t), so it is called TD target. The difference between v(st)v(s_t) and TD target is called TD error δt\delta_t. Such a difference reflects the discrepancy between two time steps tt and t+1t + 1. This is why this algorithm called temporal difference. The TD error can be interpreted as the innovation from other perspective, which indicates new information obtained from the experience sample (st,rt+1,st+1)(s_t,r_{t+1},s_{t+1}).

Compared with MC learning, TD learning is incremental as it can update the state/action values immediately after receiving an experience sample, while  MC learning is nonincremental as it must wait until an episode has been completely collected to calculate the discounted return backward.

The convergence of TD learning is guaranted by RM algorithm. You can see the proof in the book.

Other TD learning algorithms

The TD algorithm can only estimate the state values of a given policy. To find optimal policies, we still need to further calculate the action values and then conduct policy improvement. In this section, we introduce the TD algorithms that can directly estimate action values.

Recall the patterns discovered before, All the TD learning algorithms are based on the RM algorithm. The RM algorithm depends on the value that we want to derive. Here is the action value. After determining the expression of the value we want to solve, we can write the expression down based on the RM algorithm. For TD learning deriving action values, we have four algorithms: Sarsa, n-step Sarsa, and Q-learning.

Sarsa

For Sarsa, We can replace state value estimation with action value estimation. The expression of the action value can be written as:

qπ(s,a)=E[R+γqπ(S,A)s,a],for all (s,a). q_\pi(s, a) = \mathbb{E} [R + \gamma q_\pi(S', A') | s, a], \quad \text{for all } (s, a).

This can be understood as follows: q value equals the immediate reward plus the future reward for reaching the next state. Since the next state is unknown, its state value can be obtained by considering all action values ​​in the next step after that. Therefore, we can view it as first fixing the next state ss', then calculating the expectation of all actions aa in that state to obtain the state value of ss'. Finally, we calculate the expectation of ss' to obtain the future reward for state ss.

So the expression of Sarsa can be written as:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γqt(st+1,at+1))],qt+1(s,a)=qt(s,a),for all (s,a)(st,at), \begin{aligned} q_{t+1}(s_t, a_t) &= q_t(s_t, a_t) - \alpha_t(s_t, a_t) \big[ q_t(s_t, a_t) - (r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})) \big], \\ q_{t+1}(s, a) &= q_t(s, a), \quad \text{for all } (s, a) \neq (s_t, a_t), \end{aligned}

For Sarsa, we the experience samples (st,at,rt+1,st+1,at+1)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}).

Sarsa
Sarsa

n-step Sarsa

This section introduces n-step Sarsa, an extension of Sarsa. We will see that Sarsa and MC learning are two extreme cases of n-step Sarsa.Recall that the definition of the action value is

qπ(s,a)=E[GtSt=s,At=a],(7.16) q_\pi(s, a) = \mathbb{E}[G_t|S_t = s, A_t = a], \quad (7.16)

where GtG_t is the discounted return satisfying

Gt=Rt+1+γRt+2+γ2Rt+3+ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots

In fact, GtG_t can also be decomposed into different forms:

SarsaGt(1)=Rt+1+γqπ(St+1,At+1),Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),n-step SarsaGt(n)=Rt+1+γRt+2++γnqπ(St+n,At+n),MCGt()=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4 \begin{aligned} \text{Sarsa} \longleftarrow \quad & G_t^{(1)} = R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}), \\ & G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, A_{t+2}), \\ & \vdots \\ n\text{-step Sarsa} \longleftarrow \quad & G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_\pi(S_{t+n}, A_{t+n}), \\ & \vdots \\ \text{MC} \longleftarrow \quad & G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} \dots \end{aligned}

It should be noted that Gt=Gt(1)=Gt(2)=Gt(n)=Gt()G_t = G_t^{(1)} = G_t^{(2)} = G_t^{(n)} = G_t^{(\infty)}, where the superscripts merely indicate the different decomposition structures of GtG_t. Substituting different decompositions of Gt(n)G_t^{(n)} into qπ(s,a)q_\pi(s, a) results in different algorithms.

  • When n=1n = 1, we have

qπ(s,a)=E[Gt(1)s,a]=E[Rt+1+γqπ(St+1,At+1)s,a]. q_\pi(s, a) = \mathbb{E}[G_t^{(1)}|s, a] = \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1})|s, a].

The corresponding stochastic approximation algorithm for solving this equation is

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γqt(st+1,at+1))], q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - (r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})) \Big],

which is the Sarsa algorithm.

  • When n=n = \infty, we have

qπ(s,a)=E[Gt()s,a]=E[Rt+1+γRt+2+γ2Rt+3+s,a]. q_\pi(s, a) = \mathbb{E}[G_t^{(\infty)}|s, a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots |s, a].

The corresponding algorithm for solving this equation is

qt+1(st,at)=gtrt+1+γrt+2+γ2rt+3+, q_{t+1}(s_t, a_t) = g_t \doteq r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots,

where gtg_t is a sample of GtG_t. In fact, this is the MC learning algorithm, which approximates the action value of (st,at)(s_t, a_t) using the discounted return of an episode starting from (st,at)(s_t, a_t).

  • For a general value of nn, we have

qπ(s,a)=E[Gt(n)s,a]=E[Rt+1+γRt+2++γnqπ(St+n,At+n)s,a]. q_\pi(s, a) = \mathbb{E}[G_t^{(n)}|s, a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_\pi(S_{t+n}, A_{t+n})|s, a].

The corresponding algorithm for solving the above equation is

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γrt+2++γnqt(st+n,at+n))].(7.17) \begin{aligned} q_{t+1}(s_t, a_t) &= q_t(s_t, a_t) \\ &\quad - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - (r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})) \Big]. \quad (7.17) \end{aligned}

This algorithm is called n-step Sarsa.

In summary, n-step Sarsa is a more general algorithm because it becomes the (one-step) Sarsa algorithm when n=1n = 1 and the MC learning algorithm when n=n = \infty (by setting αt=1\alpha_t = 1). To implement the nn-step Sarsa algorithm, we need the experience samples (st,at,rt+1,st+1,at+1,,rt+n,st+n,at+n)(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \dots, r_{t+n}, s_{t+n}, a_{t+n}). Since (rt+n,st+n,at+n)(r_{t+n}, s_{t+n}, a_{t+n}) has not been collected at time tt, we have to wait until time t+nt + n to update the q-value of (st,at)(s_t, a_t).

To that end, the q value in n-step Sarsa can be rewritten as

qt+n(st,at)=qt+n1(st,at)αt+n1(st,at)[qt+n1(st,at)(rt+1+γrt+2++γnqt+n1(st+n,at+n))], \begin{aligned} q_{t+n}(s_t, a_t) &= q_{t+n-1}(s_t, a_t) \\ &\quad - \alpha_{t+n-1}(s_t, a_t) \Big[ q_{t+n-1}(s_t, a_t) - (r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_{t+n-1}(s_{t+n}, a_{t+n})) \Big], \end{aligned}

where qt+n(st,at)q_{t+n}(s_t, a_t) is the estimate of qπ(st,at)q_\pi(s_t, a_t) at time t+nt + n. Since n-step Sarsa includes Sarsa and MC learning as two extreme cases, it is not surprising that the performance of nn-step Sarsa is between that of Sarsa and MC learning. In particular, if nn is selected as a large number, nn-step Sarsa is close to MC learning: the estimate has a relatively high variance but a small bias. If nn is selected to be small, nn-step Sarsa is close to Sarsa: the estimate has a relatively large bias but a low variance. Finally, the nn-step Sarsa algorithm presented here is merely used for policy evaluation. It must be combined with a policy improvement step to learn optimal policies. The implementation is similar to that of Sarsa and is omitted here. Inter

Q-learning

Sarsa based algorithms are solving BE, followed by policy improvement to iteratively derive the optimal policy. Q-learning directly solve the BOE. The BOE want to find the policy that maximizes the state value. And the state value is maximized by choosing the action with maximum value. So the BOE in the q-value form can be written as

q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=s,At=a]. q(s, a) = \mathbb{E} \Big[ R_{t+1} + \gamma \max_a q(S_{t+1}, a) \Big| S_t = s, A_t = a \Big].

So Q-learning can be written as:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γmaxaA(st+1)qt(st+1,a))],qt+1(s,a)=qt(s,a),for all (s,a)(st,at), \begin{aligned} q_{t+1}(s_t, a_t) &= q_t(s_t, a_t) - \alpha_t(s_t, a_t) \bigg[ q_t(s_t, a_t) - \Big( r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} q_t(s_{t+1}, a) \Big) \bigg], \\ q_{t+1}(s, a) &= q_t(s, a), \quad \text{for all } (s, a) \neq (s_t, a_t), \end{aligned}

Proof of the BOE in q-value

By the definition of expectation, we have

q(s,a)=rp(rs,a)r+γsp(ss,a)maxaA(s)q(s,a). q(s, a) = \sum_r p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) \max_{a \in \mathcal{A}(s')} q(s', a).

Taking the maximum of both sides of the equation gives

maxaA(s)q(s,a)=maxaA(s)[rp(rs,a)r+γsp(ss,a)maxaA(s)q(s,a)]. \max_{a \in \mathcal{A}(s)} q(s, a) = \max_{a \in \mathcal{A}(s)} \left[ \sum_r p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) \max_{a \in \mathcal{A}(s')} q(s', a) \right].

By denoting v(s)maxaA(s)q(s,a)v(s) \doteq \max_{a \in \mathcal{A}(s)} q(s, a), we can rewrite the above equation as

v(s)=maxaA(s)[rp(rs,a)r+γsp(ss,a)v(s)]=maxπaA(s)π(as)[rp(rs,a)r+γsp(ss,a)v(s)], \begin{aligned} v(s) &= \max_{a \in \mathcal{A}(s)} \left[ \sum_r p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) v(s') \right] \\ &= \max_\pi \sum_{a \in \mathcal{A}(s)} \pi(a|s) \left[ \sum_r p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) v(s') \right], \end{aligned}

which is clearly the BOE.

on-policy and off-policy

Two policies exist in any reinforcement learning task: a behavior policy and a target policy. The behavior policy is the one used to generate experience samples. The target policy is the one that is constantly updated to converge to an optimal policy. When the behavior policy is the same as the target policy, such a learning process is called on-policy. Otherwise, when they are different, the learning process is called off-policy.

The advantage of off-policy learning is that it can learn optimal policies based on the experience samples generated by other policies, which may be, for example, a policy executed by a human operator. As an important case, the behavior policy can be selected to be exploratory. For example, if we would like to estimate the action values of all state action pairs, we must generate episodes visiting every state-action pair sufficiently many times. Although Sarsa uses-greedy policies to maintain certain exploration abilities, the value of is usually small and hence the exploration ability is limited. By contrast, if we can use a policy with a strong exploration ability to generate episodes and then use off-policy learning to learn optimal policies, the learning efficiency would be significantly increased.

Clearly, Sarsa and MC learning are all on-policy, as they directly update the policy after obtaining new state/action value and use new policy to generate the next experience.

Q-learning is off-policy is essence, because its target policy is updated via purely greedy max while its behavior policy is ϵ\epsilon-greedy or other explorative policy. Only if it is purely greedy policy does it is on-policy. But the model's performance degrade.

Q-learning on-policy
Q-learning on-policy
Q-learning off-policy
Q-learning off-policy

Another concept that may be confused with on-policy/off-policy is online/offline. Online learning refers to the case where the agent updates the values and policies while interacting with the environment. Offline learning refers to the case where the agent up dates the values and policies using pre-collected experience data without interacting with the environment. If an algorithm is on-policy, then it can be implemented in an online fashion, but cannot use pre-collected data generated by other policies. If an algorithm is off-policy, then it can be implemented in either an online or offline fashion.

Summary

AlgorithmExpression of the TD target qˉt\bar{q}_t
Sarsaqˉt=rt+1+γqt(st+1,at+1)\bar{q}_t = r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})
nn-step Sarsaqˉt=rt+1+γrt+2++γnqt(st+n,at+n)\bar{q}_t = r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})
Q-learningqˉt=rt+1+γmaxaqt(st+1,a)\bar{q}_t = r_{t+1} + \gamma \max_a q_t(s_{t+1}, a)
Monte Carloqˉt=rt+1+γrt+2+γ2rt+3+\bar{q}_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots

AlgorithmEquation to be solved
SarsaBE: qπ(s,a)=E[Rt+1+γqπ(St+1,At+1)St=s,At=a]q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) | S_t = s, A_t = a]
nn-step SarsaBE: qπ(s,a)=E[Rt+1+γRt+2++γnqπ(St+n,At+n)St=s,At=a]q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_\pi(S_{t+n}, A_{t+n}) | S_t = s, A_t = a]
Q-learningBOE: q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=s,At=a]q(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_a q(S_{t+1}, a) | S_t = s, A_t = a]
Monte CarloBE: qπ(s,a)=E[Rt+1+γRt+2+γ2Rt+3+St=s,At=a]q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | S_t = s, A_t = a]

*Table : A unified point of view of TD algorithms.

评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3