In chapter 8, we introduce value approximation function, that is to replace tabular representations for state/action value with function. Similarly, in chapter 9 we use function to represent policy instead fo tabular and turn to policy-based methods. So in this chapter we combine both of them, representing both value and policy with function and incorporating both policy-based and value-based methods.
Here, an “actor” refers to a policy update step. The reason that it is called an actor is that the actions are taken by following the policy. Here, an “critic” refers to a value update step. It is called a critic because it criticizes the actor by evaluating its corresponding values. From another point of view, actor-critic methods are still policy gradient algorithms. They can be obtained by extending the policy gradient algorithm introduced in Chapter 9.
As actor-critic is still policy gradient method, it still need metrics to be optimized. Revisit the idea of policy gradient introduced in the last lecture. A scalar metric J(θ), which can be vˉπ or rˉπ. The gradient-ascent algorithm maximizing J(θ) is
If qt(st,at) is estimated by Monte Carlo learning, the corresponding algorithm is called REINFORCE or Monte Carlo policy gradient, which has already been introduced in Chapter 9 (but without value approximation function).
If qt(st,at) is estimated by TD learning, the corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be obtained by incorporating TD-based value estimation into policy gradient methods.
QAC
The critic corresponds to the value update step via the Sarsa algorithm. The action values are represented by a parameterized function q(s,a,w). The actor corresponds to the policy update step in (2).
Then such a expression has the same expectation with our previous metrics. Our next step is to choose a proper b(s) to reduce the approximation variance when we use samples to approximate the true gradient. Let:
X(S,A)≐∇θlnπ(A∣S,θt)[qπ(S,A)−b(S)].(4)
Then, the true gradient is E[X(S,A)]. Since we need to use a stochastic sample x to approximate E[X], it would be favorable if the variance var(X) is small. For example, if var(X) is close to zero, then any sample x can accurately approximate E[X]. On the contrary, if var(X) is large, the value of a sample may be far from E[X].Although E[X] is invariant to the baseline, the variance var(X) is not. Our goal is to design a good baseline to minimize var(X). In the algorithms of REINFORCE and QAC, we set b=0, which is not guaranteed to be a good baseline.
Although the baseline in (5) is optimal, it is too complex to be useful in practice. If the weight ∥∇θlnπ(A∣s,θt)∥2 is removed from (5), we can obtain a suboptimal baseline that has a concise expression:
b†(s)=EA∼π[qπ(s,A)]=vπ(s),s∈S.
This suboptimal baseline is indeed the state value!
When b(s)=vπ(s), the gradient-ascent algorithm in (1) becomes
is called the advantage function, which reflects the advantage of one action over the others. More specifically, note that vπ(s)=∑a∈Aπ(a∣s)qπ(s,a) is the mean of the action values. If δπ(s,a)>0, it means that the corresponding action has a greater value than the mean value. The stochastic version of (7) is
where st,at are samples of S,A at time t. Here, qt(st,at) and vt(st) are approximations of qπ(θt)(st,at) and vπ(θt)(st), respectively.
The algorithm in (8) updates the policy based on the relative value of qt with respect to vt rather than the absolute value of qt. This is intuitively reasonable because, when we attempt to select an action at a state, we only care about which action has the greatest value relative to the others. This can be further interpreted by:
The step size is proportional to the relative value δt rather than the absolute valueqt in chapter 9, which is more reasonable. It can still well balance exploration and exploitation.
If qt(st,at) and vt(st) are estimated by Monte Carlo learning, the algorithm in (10.8) is called REINFORCE with a baseline.
If qt(st,at) and vt(st) are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).
It should be noted that the advantage function in this implementation is approximated by the TD error:
This expression of using the TD error helps a lot. One merit is that we only need to use a single neural network to represent vπ(s). Otherwise, if δt=qt(st,at)−vt(st), we need to maintain two networks to represent vπ(s) and qπ(s,a), respectively. Another metric is that we can reuse TD error in both actor and critic.
When we use the TD error, the algorithm may also be called TD actor-critic. In addition, it is notable that the policy π(θt) is stochastic and hence exploratory. Therefore, it can be directly used to generate experience samples without relying on techniques such as ϵ-greedy.
Consider a random variable X∈X. Suppose that p0(X) is a probability distribution. Our goal is to estimate EX∼p0[X]. Suppose that we have some i.i.d. samples {xi}i=1n.
First, if the samples {xi}i=1n are generated by following p0, then the average value xˉ=n1∑i=1nxi can be used to approximate EX∼p0[X] because xˉ is an unbiased estimate of EX∼p0[X] and the estimation variance converges to zero as n→∞ (see the law of large numbers in Box 5.1 for more information).
Second, consider a new scenario where the samples {xi}i=1n are not generated by p0. Instead, they are generated by another distribution p1. Can we still use these samples to approximate EX∼p0[X]? The answer is yes. However, we can no longer use xˉ=n1∑i=1nxi to approximate EX∼p0[X] since xˉ≈EX∼p1[X] rather than EX∼p0[X].
In the second scenario, EX∼p0[X] can be approximated based on the importance sampling technique. In particular, EX∼p0[X] satisfies:
Equation (10) suggests that EX∼p0[X] can be approximated by a weighted average of xi. Here, p1(xi)p0(xi) is called the importance weight. Why is it called importance sampling? Because we want to find p0. If we sample a xi, and its probability is high under p0 but low under p1, it means that it appears a lot under p0 but very little under the current p1 sampling. Therefore, we should cherish this hard-won sample, that is, this sample is very important, and we give it a large weight.
The reason why we do not directly use p0 is that it is too complex to use, such as a neural network. By contrast, (10) merely requires the values of p0(xi) for some samples instead of the complex expression and thus is much easier to implement in practice.
You can see the illustrative example provided in the tutorial to better understand it.
Like the previous on-policy case, we need to derive the policy gradient in the off-policy case.
Suppose β is the behavior policy that generates experience samples. Our aim is to use these samples to update a target policy π that can maximize the metric:
J(θ)=s∈S∑dβ(s)vπ(s)=ES∼dβ[vπ(S)],
where dβ is the stationary distribution under policy β. Here we have our theorem:
Theorem 10.1 (Off-policy policy gradient theorem)
In the discounted case where γ∈(0,1), the gradient of J(θ) is
where Prπ(s∣s′)=∑k=0∞γk[Pπk]s′s=[(I−γPπ)−1]s′s is the discounted total probability of transitioning from s′ to s under policy π.
Compared with on-policy case, off-policy version adds the importane weight, as behavior policy is not same as the target policy. For proof, see at book.
Up to now, the policies used in the policy gradient methods are all stochastic since it is required that π(a∣s,θ)>0 for every (s,a) (requirement of log). We can add the softmax function at the last layer of the neural network to achieve this. However, when action is continuous, the output is a deterministic value.
The deterministic policy is specifically denoted as:
a=μ(s,θ)≐μ(s)
μ is a mapping from S to A. μ can be represented by, for example, a neural network with the input as s, the output as a, and the parameter as θ.We may write μ(s,θ) in short as μ(s).
The policy gradient theorems introduced before are merely valid for stochastic policies. If the policy must be deterministic, we must derive a new policy gradient theorem.
where η is a distribution of the states. This theorem is a summary of the results of deterministic policy.
Unlike the stochastic case, the gradient in the deterministic case shown in (11) does not involve the action random variable A. As a result, when we use samples to approximate the true gradient, it is not required to sample actions. Therefore, the deterministic policy gradient method is off-policy.
Based on the gradient given in Theorem, we can apply the gradient-ascent algorithm to maximize J(θ):
θt+1=θt+αθES∼η[∇θμ(S)(∇aqμ(S,a))∣a=μ(S)].
The corresponding stochastic gradient-ascent algorithm is
θt+1=θt+αθ∇θμ(st)(∇aqμ(st,a))∣a=μ(st).
It should be noted that this algorithm is off-policy since the behavior policy β may be different from μ. First, the actor is off-policy. We already explained the reason when presenting Theorem. Second, the critic is also off-policy. Special attention must be paid to why the critic is off-policy but does not require the importance sampling technique. In particular, the experience sample required by the critic is (st,at,rt+1,st+1,a~t+1), where a~t+1=μ(st+1). The generation of this experience sample involves two policies. The first is the policy for generating at at st, and the second is the policy for generating a~t+1 at st+1. The first policy that generates at is the behavior policy since at is used to interact with the environment. The second policy must be μ because it is the policy that the critic aims to evaluate. Hence, μ is the target policy. It should be noted that a~t+1 is not used to interact with the environment in the next time step. Hence, μ is not the behavior policy. Therefore, the critic is off-policy.How to select the function q(s,a,w)? The original research work [74] that proposed the deterministic policy gradient method adopted linear functions: q(s,a,w)=ϕT(s,a)w where ϕ(s,a) is the feature vector. It is currently popular to represent q(s,a,w) using neural networks, as suggested in the deep deterministic policy gradient (DDPG) method.