Chapter 7 Temporal-Difference learning
Chapter 7 Temporal-Difference learning
In this section we will first introduce TD learning, which refers a wide range of algorithms. It can solve Bellman equation of a given policy without model. We refer TD learning in the first chapter specifically as a classic algorithm for estimating state values. Then we will introduce other algorithms belonging to the wide range of TD learning in the next section.
Basic TD learning algorithms
Recall the RM algorithm we learn in Chapter 6:
where is the sample of our optimizing objective obtained through the model (e.g., expectation in mean estimation or the gradient expectation in the gradient descent).
In TD learning, our objective is estimating the state value. Recall that in the model-based methods, we derive state values through solving the Bellman equation (BE). But here we do not have models (TD learning is model free), so in essence, it is a special stochastic approximation algorithm for solving the BE. Recall the definition of the state value is
We can write it into another form by definition:
As we can see, the value inside the expectation is just the definition of action value. So it can also written as:
which we will see it in the Chapter 10. From this persepctive, we can also clearly understand the relationship between action value and state value.
So our objective is to find the root of . That is:
The sample of is just . We replace this sample expectation with in RM, and finally we got the expression of TD learning algorithm:
here, is the target value that the algorithm attempts to drive , so it is called TD target. The difference between and TD target is called TD error . Such a difference reflects the discrepancy between two time steps and . This is why this algorithm called temporal difference. The TD error can be interpreted as the innovation from other perspective, which indicates new information obtained from the experience sample .
Compared with MC learning, TD learning is incremental as it can update the state/action values immediately after receiving an experience sample, while MC learning is nonincremental as it must wait until an episode has been completely collected to calculate the discounted return backward.
The convergence of TD learning is guaranted by RM algorithm. You can see the proof in the book.
Other TD learning algorithms
The TD algorithm can only estimate the state values of a given policy. To find optimal policies, we still need to further calculate the action values and then conduct policy improvement. In this section, we introduce the TD algorithms that can directly estimate action values.
Recall the patterns discovered before, All the TD learning algorithms are based on the RM algorithm. The RM algorithm depends on the value that we want to derive. Here is the action value. After determining the expression of the value we want to solve, we can write the expression down based on the RM algorithm. For TD learning deriving action values, we have four algorithms: Sarsa, n-step Sarsa, and Q-learning.
Sarsa
For Sarsa, We can replace state value estimation with action value estimation. The expression of the action value can be written as:
This can be understood as follows: q value equals the immediate reward plus the future reward for reaching the next state. Since the next state is unknown, its state value can be obtained by considering all action values in the next step after that. Therefore, we can view it as first fixing the next state , then calculating the expectation of all actions in that state to obtain the state value of . Finally, we calculate the expectation of to obtain the future reward for state .
So the expression of Sarsa can be written as:
For Sarsa, we the experience samples .

n-step Sarsa
This section introduces n-step Sarsa, an extension of Sarsa. We will see that Sarsa and MC learning are two extreme cases of n-step Sarsa.Recall that the definition of the action value is
where is the discounted return satisfying
In fact, can also be decomposed into different forms:
It should be noted that , where the superscripts merely indicate the different decomposition structures of . Substituting different decompositions of into results in different algorithms.
- When , we have
The corresponding stochastic approximation algorithm for solving this equation is
which is the Sarsa algorithm.
- When , we have
The corresponding algorithm for solving this equation is
where is a sample of . In fact, this is the MC learning algorithm, which approximates the action value of using the discounted return of an episode starting from .
- For a general value of , we have
The corresponding algorithm for solving the above equation is
This algorithm is called n-step Sarsa.
In summary, n-step Sarsa is a more general algorithm because it becomes the (one-step) Sarsa algorithm when and the MC learning algorithm when (by setting ). To implement the -step Sarsa algorithm, we need the experience samples . Since has not been collected at time , we have to wait until time to update the q-value of .
To that end, the q value in n-step Sarsa can be rewritten as
where is the estimate of at time . Since n-step Sarsa includes Sarsa and MC learning as two extreme cases, it is not surprising that the performance of -step Sarsa is between that of Sarsa and MC learning. In particular, if is selected as a large number, -step Sarsa is close to MC learning: the estimate has a relatively high variance but a small bias. If is selected to be small, -step Sarsa is close to Sarsa: the estimate has a relatively large bias but a low variance. Finally, the -step Sarsa algorithm presented here is merely used for policy evaluation. It must be combined with a policy improvement step to learn optimal policies. The implementation is similar to that of Sarsa and is omitted here. Inter
Q-learning
Sarsa based algorithms are solving BE, followed by policy improvement to iteratively derive the optimal policy. Q-learning directly solve the BOE. The BOE want to find the policy that maximizes the state value. And the state value is maximized by choosing the action with maximum value. So the BOE in the q-value form can be written as
So Q-learning can be written as:
Proof of the BOE in q-value
By the definition of expectation, we have
Taking the maximum of both sides of the equation gives
By denoting , we can rewrite the above equation as
which is clearly the BOE.
on-policy and off-policy
Two policies exist in any reinforcement learning task: a behavior policy and a target policy. The behavior policy is the one used to generate experience samples. The target policy is the one that is constantly updated to converge to an optimal policy. When the behavior policy is the same as the target policy, such a learning process is called on-policy. Otherwise, when they are different, the learning process is called off-policy.
The advantage of off-policy learning is that it can learn optimal policies based on the experience samples generated by other policies, which may be, for example, a policy executed by a human operator. As an important case, the behavior policy can be selected to be exploratory. For example, if we would like to estimate the action values of all state action pairs, we must generate episodes visiting every state-action pair sufficiently many times. Although Sarsa uses-greedy policies to maintain certain exploration abilities, the value of is usually small and hence the exploration ability is limited. By contrast, if we can use a policy with a strong exploration ability to generate episodes and then use off-policy learning to learn optimal policies, the learning efficiency would be significantly increased.
Clearly, Sarsa and MC learning are all on-policy, as they directly update the policy after obtaining new state/action value and use new policy to generate the next experience.
Q-learning is off-policy is essence, because its target policy is updated via purely greedy max while its behavior policy is -greedy or other explorative policy. Only if it is purely greedy policy does it is on-policy. But the model's performance degrade.


Another concept that may be confused with on-policy/off-policy is online/offline. Online learning refers to the case where the agent updates the values and policies while interacting with the environment. Offline learning refers to the case where the agent up dates the values and policies using pre-collected experience data without interacting with the environment. If an algorithm is on-policy, then it can be implemented in an online fashion, but cannot use pre-collected data generated by other policies. If an algorithm is off-policy, then it can be implemented in either an online or offline fashion.
Summary
| Algorithm | Expression of the TD target |
|---|---|
| Sarsa | |
| -step Sarsa | |
| Q-learning | |
| Monte Carlo |
| Algorithm | Equation to be solved |
|---|---|
| Sarsa | BE: |
| -step Sarsa | BE: |
| Q-learning | BOE: |
| Monte Carlo | BE: |
*Table : A unified point of view of TD algorithms.
