Chapter 10 Actor-Critic Methods

RyanLee_ljx...大约 10 分钟

Chapter 10 Actor-Critic Methods

In chapter 8, we introduce value approximation function, that is to replace tabular representations for state/action value with function. Similarly, in chapter 9 we use function to represent policy instead fo tabular and turn to policy-based methods. So in this chapter we combine both of them, representing both value and policy with function and incorporating both policy-based and value-based methods.

Here, an “actor” refers to a policy update step. The reason that it is called an actor is that the actions are taken by following the policy. Here, an “critic” refers to a value update step. It is called a critic because it criticizes the actor by evaluating its corresponding values. From another point of view, actor-critic methods are still policy gradient algorithms. They can be obtained by extending the policy gradient algorithm introduced in Chapter 9.

Q actor-critic (QAC)

As actor-critic is still policy gradient method, it still need metrics to be optimized. Revisit the idea of policy gradient introduced in the last lecture. A scalar metric $J(\theta)$ , which can be $\bar{v}_\pi$ or $\bar{r}_\pi$ . The gradient-ascent algorithm maximizing $J(\theta)$ is

\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta J(\theta_t) \\ &= \theta_t + \alpha \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A) \right] \end{aligned} \tag{1}

The stochastic gradient-ascent algorithm is

\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \color{blue}{q_t(s_t, a_t)}. \tag{2}

If $q_t(s_t, a_t)$ is estimated by Monte Carlo learning, the corresponding algorithm is called REINFORCE or Monte Carlo policy gradient, which has already been introduced in Chapter 9 (but without value approximation function).
If $q_t(s_t, a_t)$ is estimated by TD learning, the corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be obtained by incorporating TD-based value estimation into policy gradient methods.

The critic corresponds to the value update step via the Sarsa algorithm. The action values are represented by a parameterized function $q(s,a,w)$ . The actor corresponds to the policy update step in (2).

Advantage actor-critic (A2C)

The core idea of this algorithm is to introduce a baseline to reduce estimation variance.

How to do this? We need first learn about the property of baseline:

\mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) q_\pi(S, A) \right] = \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) (q_\pi(S, A) - b(S)) \right], \tag{3}

where the additional baseline $b(S)$ is a scalar function of $S$ . If the equation holds true, we only need to prove

\mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) b(S) \right] = 0.

This equation is valid because

\begin{aligned} \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) b(S) \right] &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \nabla_\theta \ln \pi(a|s, \theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s, \theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s, \theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_\theta \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_\theta 1 = 0. \end{aligned}

Then such a expression has the same expectation with our previous metrics. Our next step is to choose a proper $b(s)$ to reduce the approximation variance when we use samples to approximate the true gradient. Let:

X(S, A) \doteq \nabla_\theta \ln \pi(A|S, \theta_t)[q_\pi(S, A) - b(S)]. \tag{4}

Then, the true gradient is $\mathbb{E}[X(S, A)]$ . Since we need to use a stochastic sample $x$ to approximate $\mathbb{E}[X]$ , it would be favorable if the variance $\text{var}(X)$ is small. For example, if $\text{var}(X)$ is close to zero, then any sample $x$ can accurately approximate $\mathbb{E}[X]$ . On the contrary, if $\text{var}(X)$ is large, the value of a sample may be far from $\mathbb{E}[X]$ .Although $\mathbb{E}[X]$ is invariant to the baseline, the variance $\text{var}(X)$ is not. Our goal is to design a good baseline to minimize $\text{var}(X)$ . In the algorithms of REINFORCE and QAC, we set $b = 0$ , which is not guaranteed to be a good baseline.

In fact, the optimal baseline that minimizes is:

b^*(s) = \frac{\mathbb{E}_{A \sim \pi} \left[ \|\nabla_\theta \ln \pi(A|s, \theta_t)\|^2 q_\pi(s, A) \right]}{\mathbb{E}_{A \sim \pi} \left[ \|\nabla_\theta \ln \pi(A|s, \theta_t)\|^2 \right]}, \quad s \in \mathcal{S}. \tag{5}

Although the baseline in (5) is optimal, it is too complex to be useful in practice. If the weight $\|\nabla_\theta \ln \pi(A|s, \theta_t)\|^2$ is removed from (5), we can obtain a suboptimal baseline that has a concise expression:

b^\dagger(s) = \mathbb{E}_{A \sim \pi}[q_\pi(s, A)] = v_\pi(s), \quad s \in \mathcal{S}.

This suboptimal baseline is indeed the state value!

When $b(s) = v_\pi(s)$ , the gradient-ascent algorithm in (1) becomes

\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t)[q_\pi(S, A) - v_\pi(S)] \right] \\ &\doteq \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t)\delta_\pi(S, A) \right]. \end{aligned} \tag{7}

Here,

\delta_\pi(S, A) \doteq q_\pi(S, A) - v_\pi(S)

is called the advantage function, which reflects the advantage of one action over the others. More specifically, note that $v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) q_\pi(s, a)$ is the mean of the action values. If $\delta_\pi(s, a) > 0$ , it means that the corresponding action has a greater value than the mean value. The stochastic version of (7) is

\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)[q_t(s_t, a_t) - v_t(s_t)] \\ &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t)\delta_t(s_t, a_t), \end{aligned} \tag{8}

where $s_t, a_t$ are samples of $S, A$ at time $t$ . Here, $q_t(s_t, a_t)$ and $v_t(s_t)$ are approximations of $q_{\pi(\theta_t)}(s_t, a_t)$ and $v_{\pi(\theta_t)}(s_t)$ , respectively.

The algorithm in (8) updates the policy based on the relative value of $q_t$ with respect to $v_t$ rather than the absolute value of $q_t$ . This is intuitively reasonable because, when we attempt to select an action at a state, we only care about which action has the greatest value relative to the others. This can be further interpreted by:

\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t) \\ &= \theta_t + \alpha \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)} \delta_t(s_t, a_t) \\ &= \theta_t + \alpha \underbrace{\left( \frac{\color{blue}{\delta_t(s_t, a_t)}}{\pi(a_t|s_t, \theta_t)} \right)}_{\text{step size}} \nabla_\theta \pi(a_t|s_t, \theta_t) \end{aligned}

The step size is proportional to the $\color{blue}{\text{relative value } \delta_t}$ rather than the $\color{blue}{\text{absolute value} q_t}$ in chapter 9, which is more reasonable. It can still $\color{blue}{\text{well balance exploration and exploitation}}$ .

If $q_t(s_t, a_t)$ and $v_t(s_t)$ are estimated by Monte Carlo learning, the algorithm in (10.8) is called REINFORCE with a baseline.
If $q_t(s_t, a_t)$ and $v_t(s_t)$ are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).

It should be noted that the advantage function in this implementation is approximated by the TD error:

q_t(s_t, a_t) - v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t).

This approximation is reasonable as:

q_\pi(s_t, a_t) = \mathbb{E} \left[ R_{t+1} + \gamma v_\pi(S_{t+1})| S_t = s_t, A_t = a_t \right],

Thus:

q_\pi(s_t, a_t) - v_\pi(s_t) = \mathbb{E} \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) - v_\pi(S_t) | S_t = s_t, A_t = a_t \right],

This expression of using the TD error helps a lot. One merit is that we only need to use a single neural network to represent $v_\pi(s)$ . Otherwise, if $\delta_t = q_t(s_t, a_t) - v_t(s_t)$ , we need to maintain two networks to represent $v_\pi(s)$ and $q_\pi(s, a)$ , respectively. Another metric is that we can reuse TD error in both actor and critic.

When we use the TD error, the algorithm may also be called TD actor-critic. In addition, it is notable that the policy $\pi(\theta_t)$ is stochastic and hence exploratory. Therefore, it can be directly used to generate experience samples without relying on techniques such as $\epsilon$ -greedy.

Off-policy actor-critic

Importance sampling

Consider a random variable $X \in \mathcal{X}$ . Suppose that $p_0(X)$ is a probability distribution. Our goal is to estimate $\mathbb{E}_{X \sim p_0}[X]$ . Suppose that we have some i.i.d. samples $\{x_i\}_{i=1}^n$ .

First, if the samples $\{x_i\}_{i=1}^n$ are generated by following $p_0$ , then the average value $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ can be used to approximate $\mathbb{E}_{X \sim p_0}[X]$ because $\bar{x}$ is an unbiased estimate of $\mathbb{E}_{X \sim p_0}[X]$ and the estimation variance converges to zero as $n \to \infty$ (see the law of large numbers in Box 5.1 for more information).
Second, consider a new scenario where the samples $\{x_i\}_{i=1}^n$ are not generated by $p_0$ . Instead, they are generated by another distribution $p_1$ . Can we still use these samples to approximate $\mathbb{E}_{X \sim p_0}[X]$ ? The answer is yes. However, we can no longer use $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ to approximate $\mathbb{E}_{X \sim p_0}[X]$ since $\bar{x} \approx \mathbb{E}_{X \sim p_1}[X]$ rather than $\mathbb{E}_{X \sim p_0}[X]$ .

In the second scenario, $\mathbb{E}_{X \sim p_0}[X]$ can be approximated based on the importance sampling technique. In particular, $\mathbb{E}_{X \sim p_0}[X]$ satisfies:

\mathbb{E}_{X \sim p_0}[X] = \sum_{x \in \mathcal{X}} p_0(x)x = \sum_{x \in \mathcal{X}} p_1(x) \underbrace{\frac{p_0(x)}{p_1(x)} x}_{f(x)} = \mathbb{E}_{X \sim p_1}[f(X)]. \tag{9}

Thus, estimating $\mathbb{E}_{X \sim p_0}[X]$ becomes the problem of estimating $\mathbb{E}_{X \sim p_1}[f(X)]$ . Let:

\bar{f} \doteq \frac{1}{n} \sum_{i=1}^n f(x_i).

Since $\bar{f}$ can effectively approximate $\mathbb{E}_{X \sim p_1}[f(X)]$ , it then follows from (9) that

\mathbb{E}_{X \sim p_0}[X] = \mathbb{E}_{X \sim p_1}[f(X)] \approx \bar{f} = \frac{1}{n} \sum_{i=1}^n f(x_i) = \frac{1}{n} \sum_{i=1}^n \underbrace{\frac{p_0(x_i)}{p_1(x_i)}}_{\text{importance weight}} x_i. \tag{10}

Equation (10) suggests that $\mathbb{E}_{X \sim p_0}[X]$ can be approximated by a weighted average of $x_i$ . Here, $\frac{p_0(x_i)}{p_1(x_i)}$ is called the importance weight. Why is it called importance sampling? Because we want to find $p_0$ . If we sample a $x_i$ , and its probability is high under $p_0$ but low under $p_1$ , it means that it appears a lot under $p_0$ but very little under the current $p_1$ sampling. Therefore, we should cherish this hard-won sample, that is, this sample is very important, and we give it a large weight.

The reason why we do not directly use $p_0$ is that it is too complex to use, such as a neural network. By contrast, (10) merely requires the values of $p_0(x_i)$ for some samples instead of the complex expression and thus is much easier to implement in practice.

You can see the illustrative example provided in the tutorial to better understand it.

The off-policy policy gradient theorem

Like the previous on-policy case, we need to derive the policy gradient in the off-policy case.

Suppose $\beta$ is the behavior policy that generates experience samples. Our aim is to use these samples to update a target policy $\pi$ that can maximize the metric:

J(\theta) = \sum_{s \in \mathcal{S}} d_\beta(s) v_\pi(s) = \mathbb{E}_{S \sim d_\beta}[v_\pi(S)],

where $d_\beta$ is the stationary distribution under policy $\beta$ . Here we have our theorem:

Theorem 10.1 (Off-policy policy gradient theorem)

In the discounted case where $\gamma \in (0, 1)$ , the gradient of $J(\theta)$ is

\nabla_\theta J(\theta) = \mathbb{E}_{S \sim \rho, A \sim \beta} \left[ \underbrace{\frac{\pi(A|S, \theta)}{\beta(A|S)}}_{\text{importance weight}} \nabla_\theta \ln \pi(A|S, \theta) q_\pi(S, A) \right], \tag{11}

where the state distribution $\rho$ is

\rho(s) \doteq \sum_{s' \in \mathcal{S}} d_\beta(s') \text{Pr}_\pi(s|s'), \quad s \in \mathcal{S},

where $\text{Pr}_\pi(s|s') = \sum_{k=0}^\infty \gamma^k [P_\pi^k]_{s's} = [(I - \gamma P_\pi)^{-1}]_{s's}$ is the discounted total probability of transitioning from $s'$ to $s$ under policy $\pi$ .

Compared with on-policy case, off-policy version adds the importane weight, as behavior policy is not same as the target policy. For proof, see at book.

The algorithm of off-policy actor-critic

Based on the theorem, similarly, we can apply the baseline with advantage function to get the algorithm:

\theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t, \theta_t)}{\beta(a_t|s_t)} \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t)

and hence:

\theta_{t+1} = \theta_t + \alpha_\theta \left( \frac{\delta_t(s_t, a_t)}{\beta(a_t|s_t)} \right) \nabla_\theta \pi(a_t|s_t, \theta_t)

The only difference is that state adheres to another behaviral policy and add an importance weight.

Off-policy actor-critic based on importance sampling

Deterministic actor-critic (DPG)

Up to now, the policies used in the policy gradient methods are all stochastic since it is required that $\pi(a|s, \theta) > 0$ for every $(s,a)$ (requirement of $log$ ). We can add the softmax function at the last layer of the neural network to achieve this. However, when action is continuous, the output is a deterministic value.

The deterministic policy is specifically denoted as:

a = \mu(s, \theta) \doteq \mu(s)

$\mu$ is a mapping from $\mathcal{S}$ to $\mathcal{A}$ . $\mu$ can be represented by, for example, a neural network with the input as $s$ , the output as $a$ , and the parameter as $\theta$ .We may write $\mu(s, \theta)$ in short as $\mu(s)$ .

The policy gradient theorems introduced before are merely valid for stochastic policies. If the policy must be deterministic, we must derive a new policy gradient theorem.

Deterministic policy gradient theorem

The gradient of $J(\theta)$ is:

\begin{aligned} \nabla_\theta J(\theta) &= \sum_{s \in \mathcal{S}} \eta(s) \nabla_\theta \mu(s) (\nabla_a q_\mu(s, a))|_{a=\mu(s)} \\ &= \mathbb{E}_{S \sim \eta} \left[ \nabla_\theta \mu(S) (\nabla_a q_\mu(S, a))|_{a=\mu(S)} \right], \end{aligned} \tag{11}

where $\eta$ is a distribution of the states. This theorem is a summary of the results of deterministic policy.

Unlike the stochastic case, the gradient in the deterministic case shown in (11) does not involve the action random variable $A$ . As a result, when we use samples to approximate the true gradient, it is not required to sample actions. Therefore, the deterministic policy gradient method is off-policy.

Based on the gradient given in Theorem, we can apply the gradient-ascent algorithm to maximize $J(\theta)$ :

\theta_{t+1} = \theta_t + \alpha_\theta \mathbb{E}_{S \sim \eta} \left[ \nabla_\theta \mu(S) (\nabla_a q_\mu(S, a))|_{a=\mu(S)} \right].

The corresponding stochastic gradient-ascent algorithm is

\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t) (\nabla_a q_\mu(s_t, a))|_{a=\mu(s_t)}.

It should be noted that this algorithm is off-policy since the behavior policy $\beta$ may be different from $\mu$ . First, the actor is off-policy. We already explained the reason when presenting Theorem. Second, the critic is also off-policy. Special attention must be paid to why the critic is off-policy but does not require the importance sampling technique. In particular, the experience sample required by the critic is $(s_t, a_t, r_{t+1}, s_{t+1}, \tilde{a}_{t+1})$ , where $\tilde{a}_{t+1} = \mu(s_{t+1})$ . The generation of this experience sample involves two policies. The first is the policy for generating $a_t$ at $s_t$ , and the second is the policy for generating $\tilde{a}_{t+1}$ at $s_{t+1}$ . The first policy that generates $a_t$ is the behavior policy since $a_t$ is used to interact with the environment. The second policy must be $\mu$ because it is the policy that the critic aims to evaluate. Hence, $\mu$ is the target policy. It should be noted that $\tilde{a}_{t+1}$ is not used to interact with the environment in the next time step. Hence, $\mu$ is not the behavior policy. Therefore, the critic is off-policy.How to select the function $q(s, a, w)$ ? The original research work [74] that proposed the deterministic policy gradient method adopted linear functions: $q(s, a, w) = \phi^T(s, a)w$ where $\phi(s, a)$ is the feature vector. It is currently popular to represent $q(s, a, w)$ using neural networks, as suggested in the deep deterministic policy gradient (DDPG) method.

昵称

邮箱

网址

按正序
按倒序
按热度

Chapter 10 Actor-Critic Methods

预览: