Chapter 7 Temporal-Difference learning

RyanLee_ljx...大约 9 分钟

Chapter 7 Temporal-Difference learning

In this section we will first introduce TD learning, which refers a wide range of algorithms. It can solve Bellman equation of a given policy $\pi$ without model. We refer TD learning in the first chapter specifically as a classic algorithm for estimating state values. Then we will introduce other algorithms belonging to the wide range of TD learning in the next section.

Basic TD learning algorithms

Recall the RM algorithm we learn in Chapter 6:

w_{k+1}=w_k-\alpha _k \tilde g(w_k,\eta_k),

where $\tilde g$ is the sample of our optimizing objective obtained through the model (e.g., expectation in mean estimation or the gradient expectation in the gradient descent).

In TD learning, our objective is estimating the state value. Recall that in the model-based methods, we derive state values through solving the Bellman equation (BE). But here we do not have models (TD learning is model free), so in essence, it is a special stochastic approximation algorithm for solving the BE. Recall the definition of the state value is

v_{\pi}(s) = \mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_t = s], \quad s \in \mathcal{S}.

We can write it into another form by definition:

v_{\pi}(s) = \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1})|S_t = s], \quad s \in \mathcal{S}.

As we can see, the value inside the expectation is just the definition of action value. So it can also written as:

v_{\pi}(s) = \mathbb{E}[q(s, A)], \quad s \in \mathcal{S},

which we will see it in the Chapter 10. From this persepctive, we can also clearly understand the relationship between action value and state value.

So our objective is to find the root of $v_{\pi}(s)$ . That is:

g(v_\pi(s)) = v_{\pi}(s) - \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1})|S_t = s], \quad s \in \mathcal{S} = 0.

The sample of $g(v_\pi(s))$ is just $v_{t}(s_t) - r_{t+1} + \gamma v_{t}(s_{t+1})$ . We replace this sample expectation with $\tilde g$ in RM, and finally we got the expression of TD learning algorithm:

\underbrace{v_{t+1}(s_t)}_{\text{new estimate}} = \underbrace{v_t(s_t)}_{\text{current estimate}} - \alpha_t(s_t) \big[ \overbrace{v_t(s_t) - ( \underbrace{r_{t+1} + \gamma v_t(s_{t+1})}_{\text{TD target } \bar{v}_t} )}^{\text{TD error } \delta_t} \big],

here, $\bar v_t$ is the target value that the algorithm attempts to drive $v(s_t)$ , so it is called TD target. The difference between $v(s_t)$ and TD target is called TD error $\delta_t$ . Such a difference reflects the discrepancy between two time steps $t$ and $t + 1$ . This is why this algorithm called temporal difference. The TD error can be interpreted as the innovation from other perspective, which indicates new information obtained from the experience sample $(s_t,r_{t+1},s_{t+1})$ .

Compared with MC learning, TD learning is incremental as it can update the state/action values immediately after receiving an experience sample, while MC learning is nonincremental as it must wait until an episode has been completely collected to calculate the discounted return backward.

The convergence of TD learning is guaranted by RM algorithm. You can see the proof in the book.

Other TD learning algorithms

The TD algorithm can only estimate the state values of a given policy. To find optimal policies, we still need to further calculate the action values and then conduct policy improvement. In this section, we introduce the TD algorithms that can directly estimate action values.

Recall the patterns discovered before, All the TD learning algorithms are based on the RM algorithm. The RM algorithm depends on the value that we want to derive. Here is the action value. After determining the expression of the value we want to solve, we can write the expression down based on the RM algorithm. For TD learning deriving action values, we have four algorithms: Sarsa, n-step Sarsa, and Q-learning.

Sarsa

For Sarsa, We can replace state value estimation with action value estimation. The expression of the action value can be written as:

q_\pi(s, a) = \mathbb{E} [R + \gamma q_\pi(S', A') | s, a], \quad \text{for all } (s, a).

This can be understood as follows: q value equals the immediate reward plus the future reward for reaching the next state. Since the next state is unknown, its state value can be obtained by considering all action values in the next step after that. Therefore, we can view it as first fixing the next state $s'$ , then calculating the expectation of all actions $a$ in that state to obtain the state value of $s'$ . Finally, we calculate the expectation of $s'$ to obtain the future reward for state $s$ .

So the expression of Sarsa can be written as:

\begin{aligned} q_{t+1}(s_t, a_t) &= q_t(s_t, a_t) - \alpha_t(s_t, a_t) \big[ q_t(s_t, a_t) - (r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})) \big], \\ q_{t+1}(s, a) &= q_t(s, a), \quad \text{for all } (s, a) \neq (s_t, a_t), \end{aligned}

For Sarsa, we the experience samples $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})$ .

n-step Sarsa

This section introduces n-step Sarsa, an extension of Sarsa. We will see that Sarsa and MC learning are two extreme cases of n-step Sarsa.Recall that the definition of the action value is

q_\pi(s, a) = \mathbb{E}[G_t|S_t = s, A_t = a], \quad (7.16)

where $G_t$ is the discounted return satisfying

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots

In fact, $G_t$ can also be decomposed into different forms:

\begin{aligned} \text{Sarsa} \longleftarrow \quad & G_t^{(1)} = R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}), \\ & G_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, A_{t+2}), \\ & \vdots \\ n\text{-step Sarsa} \longleftarrow \quad & G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_\pi(S_{t+n}, A_{t+n}), \\ & \vdots \\ \text{MC} \longleftarrow \quad & G_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} \dots \end{aligned}

It should be noted that $G_t = G_t^{(1)} = G_t^{(2)} = G_t^{(n)} = G_t^{(\infty)}$ , where the superscripts merely indicate the different decomposition structures of $G_t$ . Substituting different decompositions of $G_t^{(n)}$ into $q_\pi(s, a)$ results in different algorithms.

When $n = 1$ , we have

q_\pi(s, a) = \mathbb{E}[G_t^{(1)}|s, a] = \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1})|s, a].

The corresponding stochastic approximation algorithm for solving this equation is

q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - (r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})) \Big],

which is the Sarsa algorithm.

When $n = \infty$ , we have

q_\pi(s, a) = \mathbb{E}[G_t^{(\infty)}|s, a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots |s, a].

The corresponding algorithm for solving this equation is

q_{t+1}(s_t, a_t) = g_t \doteq r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots,

where $g_t$ is a sample of $G_t$ . In fact, this is the MC learning algorithm, which approximates the action value of $(s_t, a_t)$ using the discounted return of an episode starting from $(s_t, a_t)$ .

For a general value of $n$ , we have

q_\pi(s, a) = \mathbb{E}[G_t^{(n)}|s, a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_\pi(S_{t+n}, A_{t+n})|s, a].

The corresponding algorithm for solving the above equation is

\begin{aligned} q_{t+1}(s_t, a_t) &= q_t(s_t, a_t) \\ &\quad - \alpha_t(s_t, a_t) \Big[ q_t(s_t, a_t) - (r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})) \Big]. \quad (7.17) \end{aligned}

This algorithm is called n-step Sarsa.

In summary, n-step Sarsa is a more general algorithm because it becomes the (one-step) Sarsa algorithm when $n = 1$ and the MC learning algorithm when $n = \infty$ (by setting $\alpha_t = 1$ ). To implement the $n$ -step Sarsa algorithm, we need the experience samples $(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \dots, r_{t+n}, s_{t+n}, a_{t+n})$ . Since $(r_{t+n}, s_{t+n}, a_{t+n})$ has not been collected at time $t$ , we have to wait until time $t + n$ to update the q-value of $(s_t, a_t)$ .

To that end, the q value in n-step Sarsa can be rewritten as

\begin{aligned} q_{t+n}(s_t, a_t) &= q_{t+n-1}(s_t, a_t) \\ &\quad - \alpha_{t+n-1}(s_t, a_t) \Big[ q_{t+n-1}(s_t, a_t) - (r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_{t+n-1}(s_{t+n}, a_{t+n})) \Big], \end{aligned}

where $q_{t+n}(s_t, a_t)$ is the estimate of $q_\pi(s_t, a_t)$ at time $t + n$ . Since n-step Sarsa includes Sarsa and MC learning as two extreme cases, it is not surprising that the performance of $n$ -step Sarsa is between that of Sarsa and MC learning. In particular, if $n$ is selected as a large number, $n$ -step Sarsa is close to MC learning: the estimate has a relatively high variance but a small bias. If $n$ is selected to be small, $n$ -step Sarsa is close to Sarsa: the estimate has a relatively large bias but a low variance. Finally, the $n$ -step Sarsa algorithm presented here is merely used for policy evaluation. It must be combined with a policy improvement step to learn optimal policies. The implementation is similar to that of Sarsa and is omitted here. Inter

Q-learning

Sarsa based algorithms are solving BE, followed by policy improvement to iteratively derive the optimal policy. Q-learning directly solve the BOE. The BOE want to find the policy that maximizes the state value. And the state value is maximized by choosing the action with maximum value. So the BOE in the q-value form can be written as

q(s, a) = \mathbb{E} \Big[ R_{t+1} + \gamma \max_a q(S_{t+1}, a) \Big| S_t = s, A_t = a \Big].

So Q-learning can be written as:

\begin{aligned} q_{t+1}(s_t, a_t) &= q_t(s_t, a_t) - \alpha_t(s_t, a_t) \bigg[ q_t(s_t, a_t) - \Big( r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} q_t(s_{t+1}, a) \Big) \bigg], \\ q_{t+1}(s, a) &= q_t(s, a), \quad \text{for all } (s, a) \neq (s_t, a_t), \end{aligned}

Proof of the BOE in q-value

By the definition of expectation, we have

q(s, a) = \sum_r p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) \max_{a \in \mathcal{A}(s')} q(s', a).

Taking the maximum of both sides of the equation gives

\max_{a \in \mathcal{A}(s)} q(s, a) = \max_{a \in \mathcal{A}(s)} \left[ \sum_r p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) \max_{a \in \mathcal{A}(s')} q(s', a) \right].

By denoting $v(s) \doteq \max_{a \in \mathcal{A}(s)} q(s, a)$ , we can rewrite the above equation as

\begin{aligned} v(s) &= \max_{a \in \mathcal{A}(s)} \left[ \sum_r p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) v(s') \right] \\ &= \max_\pi \sum_{a \in \mathcal{A}(s)} \pi(a|s) \left[ \sum_r p(r|s, a)r + \gamma \sum_{s'} p(s'|s, a) v(s') \right], \end{aligned}

which is clearly the BOE.

on-policy and off-policy

Two policies exist in any reinforcement learning task: a behavior policy and a target policy. The behavior policy is the one used to generate experience samples. The target policy is the one that is constantly updated to converge to an optimal policy. When the behavior policy is the same as the target policy, such a learning process is called on-policy. Otherwise, when they are different, the learning process is called off-policy.

The advantage of off-policy learning is that it can learn optimal policies based on the experience samples generated by other policies, which may be, for example, a policy executed by a human operator. As an important case, the behavior policy can be selected to be exploratory. For example, if we would like to estimate the action values of all state action pairs, we must generate episodes visiting every state-action pair sufficiently many times. Although Sarsa uses-greedy policies to maintain certain exploration abilities, the value of is usually small and hence the exploration ability is limited. By contrast, if we can use a policy with a strong exploration ability to generate episodes and then use off-policy learning to learn optimal policies, the learning efficiency would be significantly increased.

Clearly, Sarsa and MC learning are all on-policy, as they directly update the policy after obtaining new state/action value and use new policy to generate the next experience.

Q-learning is off-policy is essence, because its target policy is updated via purely greedy max while its behavior policy is $\epsilon$ -greedy or other explorative policy. Only if it is purely greedy policy does it is on-policy. But the model's performance degrade.

Another concept that may be confused with on-policy/off-policy is online/offline. Online learning refers to the case where the agent updates the values and policies while interacting with the environment. Offline learning refers to the case where the agent up dates the values and policies using pre-collected experience data without interacting with the environment. If an algorithm is on-policy, then it can be implemented in an online fashion, but cannot use pre-collected data generated by other policies. If an algorithm is off-policy, then it can be implemented in either an online or offline fashion.

Summary

Algorithm	Expression of the TD target $\bar{q}_t$
Sarsa	$\bar{q}_t = r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})$
$n$ -step Sarsa	$\bar{q}_t = r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})$
Q-learning	$\bar{q}_t = r_{t+1} + \gamma \max_a q_t(s_{t+1}, a)$
Monte Carlo	$\bar{q}_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots$

Algorithm	Equation to be solved
Sarsa	BE: $q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \| S_t = s, A_t = a]$
$n$ -step Sarsa	BE: $q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_\pi(S_{t+n}, A_{t+n}) \| S_t = s, A_t = a]$
Q-learning	BOE: $q(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_a q(S_{t+1}, a) \| S_t = s, A_t = a]$
Monte Carlo	BE: $q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots \| S_t = s, A_t = a]$

*Table : A unified point of view of TD algorithms.

昵称

邮箱

网址

按正序
按倒序
按热度

Chapter 7 Temporal-Difference learning

预览: