Chapter 8 Value Function Methods

RyanLee_ljx...大约 10 分钟

Chapter 8 Value Function Methods

In this chapter we move from previous tabular representation for state/action value to function representation. That is to say, we use a function to fit the true expression of the state/action value function. Such a function can be predifined, e.g., a linear function, or a neural network.

The reason why we move from tabular-based representation to function-based are:

storage, compared with tabular, we only need to store a few parameters describing the function.
generalization, action and state may be infinite (or continuous), we cannot store all the state/action value. We need to find a unified representation.

An illustration of the function approximation method. The x-axis and y-axis correspond to and , respectively. — An illustration of the function approximation method. The x-axis and y-axis correspond to $s$ and $\hat v(s)$ , respectively.

An illustration of the process for retrieving the value of s when using the function approximation method.

As shown in the figure, we update parameters $\omega$ to update the value.

To emphasize, the difference between TD learning with approximation and the previous one is that the former uses TD learning to estimate the value and then continuously updates the action value based on the RM algorithm. TD learning with approximation, on the other hand, is based on SGD (which is essentially also an RM algorithm), using TD learning to estimate the value and continuously updating the parameter values used to approximate the action value function.

Function-based Value representation

A simple linear function approximation:

\hat{v}(s, w) = as + b = \underbrace{[s, 1]}_{\phi^T(s)} \underbrace{\begin{bmatrix} a \\ b \end{bmatrix}}_{w} = \phi^T(s)w

here

$w$ is the parameter vector
$\phi(s)$ is the feature vector of $s$
$\hat{v}(s, w)$ is linear in $w$

We can also fit the points using a second-order curve:

\hat{v}(s, w) = as^2 + bs + c = \underbrace{[s^2, s, 1]}_{\phi^T(s)} \underbrace{\begin{bmatrix} a \\ b \\ c \end{bmatrix}}_{w} = \phi^T(s)w.

In this case,The dimensions of $w$ and $\phi(s)$ increase, but the values may be fitted more accurately. Although $\hat{v}(s, w)$ is nonlinear in $s$ , it is linear in $w$ . The nonlinearity is contained in $\phi(s)$ .

we can use even higher-order polynomial curves or other complex curves to fit the dots, which we can better approximate true value function but needs more parameter.

TD learning of state values based on function approximation

objective

As we are trying to fit the value function, our optimization objective is:

J(w) = \mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w))^2],

here we invovle the random variable $S$ , if the state is in uniform, we have:

J(w) = \frac{1}{n} \sum_{s \in \mathcal{S}} (v_{\pi}(s) - \hat{v}(s, w))^2,

however, we can write it into a more general form:

J(w) = \sum_{s \in \mathcal{S}} d_{\pi}(s)(v_{\pi}(s) - \hat{v}(s, w))^2,

here $d_{\pi}(s)$ is called the the stationary distribution of the Markov process under policy $\pi$ . That is, the probability for the agent visiting $s$ after a long period of time is $d_\pi(s)$ . By definition, $\sum_{s \in \mathcal{S}} d_\pi(s) = 1$ .

It is notable that the value of $d_\pi(s)$ is nontrivial to obtain because it requires knowing the state transition probability matrix $P_\pi$ . Fortunately, we do not need to calculate the specific value of $d_\pi(s)$ to minimize this objective function as shown in the next subsection. If the state space is continuous, we can replace the summations with integrals in the above equation.

Stationary distribution

Once a policy $\pi$ is given, the MDP becomes a Markov process, and its dynamics are governed by the probability transition matrix $P_\pi \in \mathbb{R}^{n \times n}$ .

Let $d_0 \in \mathbb{R}^n$ be a vector representing the probability distribution of the states at the initial time step. The probability distribution vector after exactly $k$ steps, denoted as $d_k$ , can be formulated in matrix-vector form as:

d_k^T = d_0^T P_\pi^k,

here, $P_{\pi}^{k}$ represents the probability matrix of the agent transitioning from state $s_{i}$ to $s_{j}$ after exactly $k$ steps under policy $\pi$ .

When analyzing the long-term behavior of the Markov process, it holds under certain conditions that the matrix $P_\pi^k$ converges as $k$ approaches infinity:

\lim_{k \to \infty} P_\pi^k = 1_n d_\pi^T

Here, $1_n = [1, \dots, 1]^T \in \mathbb{R}^n$ , which means $1_n d_\pi^T$ is a constant matrix where all of its rows are equal to $d_\pi^T$ . Substituting this limit back into the state evolution equation yields:

\lim_{k \to \infty} d_k^T = d_0^T \lim_{k \to \infty} P_\pi^k = d_0^T 1_n d_\pi^T = d_\pi^T

This mathematical derivation is valid because the sum of the initial probabilities equals one ( $d_0^T 1_n = 1$ ). This Equation demonstrates that the state distribution $d_k$ converges to a constant vector $d_\pi$ , defined as the limiting distribution. Notably, this long-term probability distribution is independent of the initial distribution $d_0$ .

The analytical value of $d_\pi$ can be calculated by leveraging the recursive nature of the transition process. By taking the limit of both sides of the iterative equation $d_k^T = d_{k-1}^T P_\pi$ , we obtain:

\lim_{k \to \infty} d_k^T = \lim_{k \to \infty} d_{k-1}^T P_\pi

Since both distributions converge as $k \to \infty$ , this directly resolves to the stationary equation:

d_\pi^T = d_\pi^T P_\pi

In linear algebra, the left eigenvector $x$ and its corresponding eigenvalue $\lambda$ of a matrix $A$ satisfy the definition $x^{T}A = \lambda x^{T}$ .

Comparing this to the steady-state equation $d_{\pi}^{T}P_{\pi} = 1 \cdot d_{\pi}^{T}$ , it is clear that $d_{\pi}$ is precisely the left eigenvector of the probability transition matrix $P_{\pi}$ corresponding to the eigenvalue $\lambda=1$ . when transforming the formula to $(P_{\pi}^{T} - I)d_{\pi} = 0$ , we find the corresponding eigenvector with an eigenvalue of 1 by solving this homogeneous linear system of equations.

Optimization

We can adopt the gradient descent method to optimaize the objective:

w_{k+1} = w_k - \alpha_k \nabla_w J(w_k),

where

\begin{aligned} \nabla_w J(w_k) &= \nabla_w \mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w_k))^2] \\ &= \mathbb{E}[\nabla_w (v_{\pi}(S) - \hat{v}(S, w_k))^2] \\ &= 2\mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w_k))(-\nabla_w \hat{v}(S, w_k))] \\ &= -2\mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w_k))\nabla_w \hat{v}(S, w_k)]. \end{aligned}

Therefore, the gradient descent algorithm is

w_{k+1} = w_k + 2\alpha_k \mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w_k))\nabla_w \hat{v}(S, w_k)],

where the coefficient $2$ before $\alpha_k$ can be merged into $\alpha_k$ without loss of generality. The algorithm requires calculating the expectation. In the spirit of stochastic gradient descent (SGD), we can replace the true gradient with a stochastic gradient. Then, we have

w_{t+1} = w_t + \alpha_t (v_{\pi}(s_t) - \hat{v}(s_t, w_t))\nabla_w \hat{v}(s_t, w_t),

where $s_t$ is a sample of $S$ at time $t$ . It requires the true state value $v_{\pi}$ , which is unknown and must be estimated. We can replace $v_{\pi}(s_t)$ with an approximation to make the algorithm implementable.

The following two methods can be used to do so.

Monte Carlo method: Suppose that we have an episode $(s_0, r_1, s_1, r_2, \dots)$ . Let $g_t$ be the discounted return starting from $s_t$ . Then, $g_t$ can be used as an approximation of $v_{\pi}(s_t)$ . The algorithm in (8.12) becomes:

w_{t+1} = w_t + \alpha_t (g_t - \hat{v}(s_t, w_t))\nabla_w \hat{v}(s_t, w_t)

This is the algorithm of Monte Carlo learning with function approximation.

Temporal-difference method: In the spirit of TD learning, $r_{t+1} + \gamma \hat{v}(s_{t+1}, w_t)$ can be used as an approximation of $v_{\pi}(s_t)$ . The algorithm becomes

w_{t+1} = w_t + \alpha_t [r_{t+1} + \gamma \hat{v}(s_{t+1}, w_t) - \hat{v}(s_t, w_t)]\nabla_w \hat{v}(s_t, w_t).

This is the algorithm of TD learning with function approximation.

We can use the function introduce in Function-based Value representation to represent which function is we use to approximate the true value function. If we use linear approximation function, the gradient is $\nabla_w \hat{v}(s, w) = \phi(s)$ . We can also use neural network.

Theoretical analysis

For the convergence, see at book.

The true state-value function $v_{\pi}$ strictly satisfies the Bellman equation: $v_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}$ . When we introduce a function approximator (e.g., a linear function or neural network with parameters $w$ ), we hope that the estimated value $\hat{v}(w)$ also satisfies this equation as much as possible. Therefore, the difference between the two sides of the equation is the Bellman error:

J_{BE}(w) = \|\hat{v}(w) - (r_\pi + \gamma P_\pi \hat{v}(w))\|^2_D \doteq \|\hat{v}(w) - T_\pi(\hat{v}(w))\|^2_D.

The Bellman error often cannot be optimized to 0 because the function approximators we use (such as low-dimensional linear features) have limited approximation ability, the result mapped by the Bellman operator $T_{\pi}(\hat{v}(w))=r_\pi + \gamma P_\pi \hat{v}(w)$ often "runs out" of the space that our function approximator can represent.

That error leads to that the truly optimized objective is not the objective we talk in previous section but the objective called the projected Bellman error.

J_{PBE}(w)=||\hat{v}(w)-MT_{\pi}(\hat{v}(w))||_{D}^{2}.

here, $M$ is an orthogonal projection matrix. Its function is to geometrically project any vector back into the space that the approximator can represent. Since $\hat{v}(w)$ itself is in the approximation space, and the target is also projected into this space, there must exist a parameter $w$ that can reduce the projection error to 0. In fact, the solution converged by the classic TD Learning algorithm is precisely the solution that minimizes the projection Bellman error.

TD learning of action values based on function approximation

Here we use function to approximate action values.

The Sarsa algorithm with function approximation can be readily obtained by replacing the state values with action values in TD learning of state values based on function approximation section. In particular, suppose that $q_{\pi}(s, a)$ is approximated by $\hat{q}(s, a, w)$ . Replacing $\hat{v}(s, w)$ by $\hat{q}(s, a, w)$ gives:

w_{t+1} = w_t + \alpha_t [r_{t+1} + \gamma \hat{q}(s_{t+1}, a_{t+1}, w_t) - \hat{q}(s_t, a_t, w_t)] \nabla_w \hat{q}(s_t, a_t, w_t).

Pseudocode: Sarsa with function approximation

Tabular Q-learning can also be extended to the case of function approximation. The update rule is

w_{t+1} = w_t + \alpha_t \left[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat{q}(s_{t+1}, a, w_t) - \hat{q}(s_t, a_t, w_t) \right] \nabla_w \hat{q}(s_t, a_t, w_t).

Deep Q-learning

Mathematically, deep Q-learning aims to minimize the following objective function:

J = \mathbb{E} \left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) \right)^2 \right],

where $(S, A, R, S')$ are random variables that denote a state, an action, the immediate reward, and the next state, respectively. This objective function can be viewed as the squared Bellman optimality error. That is because $q(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a \in \mathcal{A}(S_{t+1})} q(S_{t+1}, a) \Big| S_t = s, A_t = a \right], \quad \text{for all } s, a$ is the Bellman optimality equation. Therefore, $R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w)$ should equal zero in the expectation sense when $\hat{q}(S, A, w)$ can accurately approximate the optimal action values. This means that DQN aligns with Q-learning mathematicaly.

To minimize this objective function, we can also use the gradient descent algorithm. To that end, we need to calculate the gradient of $J$ with respect to $w$ . It is noted that the parameter $w$ appears not only in $\hat{q}(S, A, w)$ but also in $y \doteq R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w)$ . As a result, it is nontrivial to calculate the gradient. For the sake of simplicity, it is assumed that the value of $w$ in $y$ is fixed (for a short period of time) so that the calculation of the gradient becomes much easier. In particular, we introduce two networks:

Main Network: represent $\hat{q}(s, a, w)$ , namely the approximation of action value.
Target network : represent $\hat{q}(s, a, w_T)$ , namely the approximation of our target action value (optimal action value).

The objective function in this case becomes:

J = \mathbb{E} \left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w_T) - \hat{q}(S, A, w) \right)^2 \right],

where $w_T$ is the target network's parameter. When $w_T$ is fixed, the gradient of $J$ is

\nabla_w J = -\mathbb{E} \left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w_T) - \hat{q}(S, A, w) \right) \nabla_w \hat{q}(S, A, w) \right],

where some constant coefficients are omitted without loss of generality.

The main network is updated in every iteration. By contrast, the target network is set to be the same as the main network every certain number of iterations to satisfy the assumption that $w_T$ is fixed when calculating the gradient.

As we introduce the deep neural network here, previous incremental updates based on temporal sequence is no longer suitable anymore (we will introduce reasons later). Thus, we need to introduce one technique called experience reply. That is, after we have collected some experience samples, we do not use these samples in the order they were collected. Instead, we store them in a dataset called the replay buffer. In particular, let $(s, a, r, s')$ be an experience sample and $\mathcal{B} \doteq \{(s, a, r, s')\}$ be the replay buffer. Every time we update the main network, we can draw a mini-batch of experience samples from the replay buffer.

Why is experience replay necessary in deep Q-learning, and why must the replay follow a uniform distribution? The answer lies in the objective function of DQN. In particular, to well define the objective function, we must specify the probability distributions for $S, A, R, S'$ . The distributions of $R$ and $S'$ are determined by the system model once $(S, A)$ is given. The simplest way to describe the distribution of the state-action pair $(S, A)$ is to assume it to be uniformly distributed. However, the state-action samples may not be uniformly distributed in practice since they are generated as a sample sequence according to the behavior policy. It is necessary to break the correlation between the samples in the sequence to satisfy the assumption of uniform distribution. To do this, we can use the experience replay technique by uniformly drawing samples from the replay buffer. This is the mathematical reason why experience replay is necessary and why experience replay must follow a uniform distribution.

The reason why previous Sarsa and Q-learning (becomes DQN) do not adopt the experience reply is that we assume we use the linear approximation function. It can be proved that when adotping linear approximation is convergent when the State $S$ follows the stationary distribution $d_{\pi}(s)$ we talked before. In this circumstance, when the algorithm collects and updates data online in the order of time steps $t, t+1, t+2$ , these samples naturally follow a stationary distribution of the current behavior strategy. Therefore, in the mathematical derivation, the algorithm directly uses the sample sequence naturally generated by the Markov chain to approximate the expectation, without needing to construct an additional distribution.

However, when with deep neural network, both Sarsa and Q-learning should follow experience reply like DQN (Deep Sarsa is hard to implement as it is on-policy, which is contradictory with experience replay, as Sarsa's updates rely on the formula $\hat{q}(s_{t+1}, a_{t+1}, w_t)$ , where $a_{t+1}$ must be the actual sampled value at $s_{t+1}$ from the current latest policy. If you store data from hundreds of thousands of steps ago in a global replay buffer and then extract it for training, that action $a_{t+1}$ will no longer be the action the current policy would take. Using a replay buffer directly undermines Sarsa's on-policy theoretical foundation, causing the algorithm to evaluate something other than the current policy.).

昵称

邮箱

网址

按正序
按倒序
按热度

Chapter 8 Value Function Methods

预览: