跳至主要內容

Chapter 8 Value Function Methods

RyanLee_ljx...大约 9 分钟RL

Chapter 8 Value Function Methods

In this chapter we move from previous tabular representation for state/action value to function representation. That is to say, we use a function to fit the true expression of the state/action value function. Such a function can be predifined, e.g., a linear function, or a neural network.

The reason why we move from tabular-based representation to function-based are:

  1. storage, compared with tabular, we only need to store a few parameters describing the function.
  2. generalization, action and state may be infinite (or continuous), we cannot store all the state/action value. We need to find a unified representation.
An illustration of the function approximation method. The x-axis and y-axis correspond to  and , respectively.
An illustration of the function approximation method. The x-axis and y-axis correspond to ss and v^(s)\hat v(s), respectively.
An illustration of the process for retrieving the value of s when using the function approximation method.
An illustration of the process for retrieving the value of s when using the function approximation method.

As shown in the figure, we update parameters \ommega\ommega to update the value.

Function-based Value representation

A simple linear function approximation:

v^(s,w)=as+b=[s,1]ϕT(s)[ab]w=ϕT(s)w \hat{v}(s, w) = as + b = \underbrace{[s, 1]}_{\phi^T(s)} \underbrace{\begin{bmatrix} a \\ b \end{bmatrix}}_{w} = \phi^T(s)w

here

  • ww is the parameter vector
  • ϕ(s)\phi(s) is the feature vector of ss
  • v^(s,w)\hat{v}(s, w) is linear in ww

We can also fit the points using a second-order curve:

v^(s,w)=as2+bs+c=[s2,s,1]ϕT(s)[abc]w=ϕT(s)w. \hat{v}(s, w) = as^2 + bs + c = \underbrace{[s^2, s, 1]}_{\phi^T(s)} \underbrace{\begin{bmatrix} a \\ b \\ c \end{bmatrix}}_{w} = \phi^T(s)w.

In this case,The dimensions of ww and ϕ(s)\phi(s) increase, but the values may be fitted more accurately. Although v^(s,w)\hat{v}(s, w) is nonlinear in ss, it is linear in ww. The nonlinearity is contained in ϕ(s)\phi(s).

we can use even higher-order polynomial curves or other complex curves to fit the dots, which we can better approximate true value function but needs more parameter.

TD learning of state values based on function approximation

objective

As we are trying to fit the value function, our optimization objective is:

J(w)=E[(vπ(S)v^(S,w))2], J(w) = \mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w))^2],

here we invovle the random variable SS, if the state is in uniform, we have:

J(w)=1nsS(vπ(s)v^(s,w))2, J(w) = \frac{1}{n} \sum_{s \in \mathcal{S}} (v_{\pi}(s) - \hat{v}(s, w))^2,

however, we can write it into a more general form:

J(w)=sSdπ(s)(vπ(s)v^(s,w))2, J(w) = \sum_{s \in \mathcal{S}} d_{\pi}(s)(v_{\pi}(s) - \hat{v}(s, w))^2,

here d_{\pi}(s) is called the the stationary distribution of the Markov process under policy π\pi. That is, the probability for the agent visiting ss after a long period of time is dπ(s)d_\pi(s). By definition, sSdπ(s)=1\sum_{s \in \mathcal{S}} d_\pi(s) = 1.

It is notable that the value of dπ(s)d_\pi(s) is nontrivial to obtain because it requires knowing the state transition probability matrix PπP_\pi. Fortunately, we do not need to calculate the specific value of dπ(s)d_\pi(s) to minimize this objective function as shown in the next subsection. If the state space is continuous, we can replace the summations with integrals in the above equation.

Stationary distribution

Once a policy π\pi is given, the MDP becomes a Markov process, and its dynamics are governed by the probability transition matrix PπRn×nP_\pi \in \mathbb{R}^{n \times n}.

Let d0Rnd_0 \in \mathbb{R}^n be a vector representing the probability distribution of the states at the initial time step. The probability distribution vector after exactly kk steps, denoted as dkd_k, can be formulated in matrix-vector form as:

dkT=d0TPπk, d_k^T = d_0^T P_\pi^k,

here, PπkP_{\pi}^{k} represents the probability matrix of the agent transitioning from state sis_{i} to sjs_{j} after exactly kk steps under policy π\pi.

When analyzing the long-term behavior of the Markov process, it holds under certain conditions that the matrix PπkP_\pi^k converges as kk approaches infinity:

limkPπk=1ndπT \lim_{k \to \infty} P_\pi^k = 1_n d_\pi^T

Here, 1n=[1,,1]TRn1_n = [1, \dots, 1]^T \in \mathbb{R}^n, which means 1ndπT1_n d_\pi^T is a constant matrix where all of its rows are equal to dπTd_\pi^T. Substituting this limit back into the state evolution equation yields:

limkdkT=d0TlimkPπk=d0T1ndπT=dπT \lim_{k \to \infty} d_k^T = d_0^T \lim_{k \to \infty} P_\pi^k = d_0^T 1_n d_\pi^T = d_\pi^T

This mathematical derivation is valid because the sum of the initial probabilities equals one (d0T1n=1d_0^T 1_n = 1). This Equation demonstrates that the state distribution dkd_k converges to a constant vector dπd_\pi, defined as the limiting distribution. Notably, this long-term probability distribution is independent of the initial distribution d0d_0.

The analytical value of dπd_\pi can be calculated by leveraging the recursive nature of the transition process. By taking the limit of both sides of the iterative equation dkT=dk1TPπd_k^T = d_{k-1}^T P_\pi, we obtain:

limkdkT=limkdk1TPπ \lim_{k \to \infty} d_k^T = \lim_{k \to \infty} d_{k-1}^T P_\pi

Since both distributions converge as kk \to \infty, this directly resolves to the stationary equation:

dπT=dπTPπ d_\pi^T = d_\pi^T P_\pi

In linear algebra, the left eigenvector xx and its corresponding eigenvalue λ\lambda of a matrix AA satisfy the definition xTA=λxTx^{T}A = \lambda x^{T}.

Comparing this to the steady-state equation dπTPπ=1dπTd_{\pi}^{T}P_{\pi} = 1 \cdot d_{\pi}^{T}, it is clear that dπd_{\pi} is precisely the left eigenvector of the probability transition matrix PπP_{\pi} corresponding to the eigenvalue λ=1\lambda=1. when transforming the formula to (PπTI)dπ=0(P_{\pi}^{T} - I)d_{\pi} = 0, we find the corresponding eigenvector with an eigenvalue of 1 by solving this homogeneous linear system of equations.

Optimization

We can adopt the gradient descent method to optimaize the objective:

wk+1=wkαkwJ(wk), w_{k+1} = w_k - \alpha_k \nabla_w J(w_k),

where

wJ(wk)=wE[(vπ(S)v^(S,wk))2]=E[w(vπ(S)v^(S,wk))2]=2E[(vπ(S)v^(S,wk))(wv^(S,wk))]=2E[(vπ(S)v^(S,wk))wv^(S,wk)]. \begin{aligned} \nabla_w J(w_k) &= \nabla_w \mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w_k))^2] \\ &= \mathbb{E}[\nabla_w (v_{\pi}(S) - \hat{v}(S, w_k))^2] \\ &= 2\mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w_k))(-\nabla_w \hat{v}(S, w_k))] \\ &= -2\mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w_k))\nabla_w \hat{v}(S, w_k)]. \end{aligned}

Therefore, the gradient descent algorithm is

wk+1=wk+2αkE[(vπ(S)v^(S,wk))wv^(S,wk)], w_{k+1} = w_k + 2\alpha_k \mathbb{E}[(v_{\pi}(S) - \hat{v}(S, w_k))\nabla_w \hat{v}(S, w_k)],

where the coefficient 22 before αk\alpha_k can be merged into αk\alpha_k without loss of generality. The algorithm requires calculating the expectation. In the spirit of stochastic gradient descent (SGD), we can replace the true gradient with a stochastic gradient. Then, we have

wt+1=wt+αt(vπ(st)v^(st,wt))wv^(st,wt), w_{t+1} = w_t + \alpha_t (v_{\pi}(s_t) - \hat{v}(s_t, w_t))\nabla_w \hat{v}(s_t, w_t),

where sts_t is a sample of SS at time tt. It requires the true state value vπv_{\pi}, which is unknown and must be estimated. We can replace vπ(st)v_{\pi}(s_t) with an approximation to make the algorithm implementable.

The following two methods can be used to do so.

  1. Monte Carlo method: Suppose that we have an episode (s0,r1,s1,r2,)(s_0, r_1, s_1, r_2, \dots). Let gtg_t be the discounted return starting from sts_t. Then, gtg_t can be used as an approximation of vπ(st)v_{\pi}(s_t). The algorithm in (8.12) becomes:

wt+1=wt+αt(gtv^(st,wt))wv^(st,wt) w_{t+1} = w_t + \alpha_t (g_t - \hat{v}(s_t, w_t))\nabla_w \hat{v}(s_t, w_t)

This is the algorithm of Monte Carlo learning with function approximation.

  1. Temporal-difference method: In the spirit of TD learning, rt+1+γv^(st+1,wt)r_{t+1} + \gamma \hat{v}(s_{t+1}, w_t) can be used as an approximation of vπ(st)v_{\pi}(s_t). The algorithm becomes

wt+1=wt+αt[rt+1+γv^(st+1,wt)v^(st,wt)]wv^(st,wt). w_{t+1} = w_t + \alpha_t [r_{t+1} + \gamma \hat{v}(s_{t+1}, w_t) - \hat{v}(s_t, w_t)]\nabla_w \hat{v}(s_t, w_t).

This is the algorithm of TD learning with function approximation.

We can use the function introduce in Function-based Value representation to represent which function is we use to approximate the true value function. If we use linear approximation function, the gradient is wv^(s,w)=ϕ(s)\nabla_w \hat{v}(s, w) = \phi(s). We can also use neural network.

Theoretical analysis

For the convergence, see at book.

The true state-value function vπv_{\pi} strictly satisfies the Bellman equation: vπ=rπ+γPπvπv_{\pi}=r_{\pi}+\gamma P_{\pi}v_{\pi}. When we introduce a function approximator (e.g., a linear function or neural network with parameters ww), we hope that the estimated value v^(w)\hat{v}(w) also satisfies this equation as much as possible. Therefore, the difference between the two sides of the equation is the Bellman error:

JBE(w)=v^(w)(rπ+γPπv^(w))D2v^(w)Tπ(v^(w))D2. J_{BE}(w) = \|\hat{v}(w) - (r_\pi + \gamma P_\pi \hat{v}(w))\|^2_D \doteq \|\hat{v}(w) - T_\pi(\hat{v}(w))\|^2_D.

The Bellman error often cannot be optimized to 0 because the function approximators we use (such as low-dimensional linear features) have limited approximation ability, the result mapped by the Bellman operator Tπ(v^(w))=rπ+γPπv^(w)T_{\pi}(\hat{v}(w))=r_\pi + \gamma P_\pi \hat{v}(w) often "runs out" of the space that our function approximator can represent.

That error leads to that the truly optimized objective is not the objective we talk in previous section but the objective called the projected Bellman error.

JPBE(w)=v^(w)MTπ(v^(w))D2. J_{PBE}(w)=||\hat{v}(w)-MT_{\pi}(\hat{v}(w))||_{D}^{2}.

here, MM is an orthogonal projection matrix. Its function is to geometrically project any vector back into the space that the approximator can represent. Since v^(w)\hat{v}(w) itself is in the approximation space, and the target is also projected into this space, there must exist a parameter ww that can reduce the projection error to 0. In fact, the solution converged by the classic TD Learning algorithm is precisely the solution that minimizes the projection Bellman error.

TD learning of action values based on function approximation

Here we use function to approximate action values.

The Sarsa algorithm with function approximation can be readily obtained by replacing the state values with action values in TD learning of state values based on function approximation section. In particular, suppose that qπ(s,a)q_{\pi}(s, a) is approximated by q^(s,a,w)\hat{q}(s, a, w). Replacing v^(s,w)\hat{v}(s, w) by q^(s,a,w)\hat{q}(s, a, w) gives:

wt+1=wt+αt[rt+1+γq^(st+1,at+1,wt)q^(st,at,wt)]wq^(st,at,wt). w_{t+1} = w_t + \alpha_t [r_{t+1} + \gamma \hat{q}(s_{t+1}, a_{t+1}, w_t) - \hat{q}(s_t, a_t, w_t)] \nabla_w \hat{q}(s_t, a_t, w_t).

Pseudocode: Sarsa with function approximation
Pseudocode: Sarsa with function approximation

Tabular Q-learning can also be extended to the case of function approximation. The update rule is

wt+1=wt+αt[rt+1+γmaxaA(st+1)q^(st+1,a,wt)q^(st,at,wt)]wq^(st,at,wt). w_{t+1} = w_t + \alpha_t \left[ r_{t+1} + \gamma \max_{a \in \mathcal{A}(s_{t+1})} \hat{q}(s_{t+1}, a, w_t) - \hat{q}(s_t, a_t, w_t) \right] \nabla_w \hat{q}(s_t, a_t, w_t).

Deep Q-learning

Mathematically, deep Q-learning aims to minimize the following objective function:

J=E[(R+γmaxaA(S)q^(S,a,w)q^(S,A,w))2], J = \mathbb{E} \left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) \right)^2 \right],

where (S,A,R,S)(S, A, R, S') are random variables that denote a state, an action, the immediate reward, and the next state, respectively. This objective function can be viewed as the squared Bellman optimality error. That is because q(s,a)=E[Rt+1+γmaxaA(St+1)q(St+1,a)St=s,At=a],for all s,aq(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a \in \mathcal{A}(S_{t+1})} q(S_{t+1}, a) \Big| S_t = s, A_t = a \right], \quad \text{for all } s, a is the Bellman optimality equation. Therefore, R+γmaxaA(S)q^(S,a,w)q^(S,A,w)R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w) - \hat{q}(S, A, w) should equal zero in the expectation sense when q^(S,A,w)\hat{q}(S, A, w) can accurately approximate the optimal action values. This means that DQN aligns with Q-learning mathematicaly.

To minimize this objective function, we can also use the gradient descent algorithm. To that end, we need to calculate the gradient of JJ with respect to ww. It is noted that the parameter ww appears not only in q^(S,A,w)\hat{q}(S, A, w) but also in yR+γmaxaA(S)q^(S,a,w)y \doteq R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w). As a result, it is nontrivial to calculate the gradient. For the sake of simplicity, it is assumed that the value of ww in yy is fixed (for a short period of time) so that the calculation of the gradient becomes much easier. In particular, we introduce two networks:

  1. Main Network: represent q^(s,a,w)\hat{q}(s, a, w), namely the approximation of action value.
  2. Target network : represent q^(s,a,wT)\hat{q}(s, a, w_T), namely the approximation of our target action value (optimal action value).

The objective function in this case becomes:

J=E[(R+γmaxaA(S)q^(S,a,wT)q^(S,A,w))2], J = \mathbb{E} \left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w_T) - \hat{q}(S, A, w) \right)^2 \right],

where wTw_T is the target network's parameter. When wTw_T is fixed, the gradient of JJ is

wJ=E[(R+γmaxaA(S)q^(S,a,wT)q^(S,A,w))wq^(S,A,w)], \nabla_w J = -\mathbb{E} \left[ \left( R + \gamma \max_{a \in \mathcal{A}(S')} \hat{q}(S', a, w_T) - \hat{q}(S, A, w) \right) \nabla_w \hat{q}(S, A, w) \right],

where some constant coefficients are omitted without loss of generality.

The main network is updated in every iteration. By contrast, the target network is set to be the same as the main network every certain number of iterations to satisfy the assumption that wTw_T is fixed when calculating the gradient.

As we introduce the deep neural network here, previous incremental updates based on temporal sequence is no longer suitable anymore (we will introduce reasons later). Thus, we need to introduce one technique called experience reply. That is, after we have collected some experience samples, we do not use these samples in the order they were collected. Instead, we store them in a dataset called the replay buffer. In particular, let (s,a,r,s)(s, a, r, s') be an experience sample and B{(s,a,r,s)}\mathcal{B} \doteq \{(s, a, r, s')\} be the replay buffer. Every time we update the main network, we can draw a mini-batch of experience samples from the replay buffer.

Why is experience replay necessary in deep Q-learning, and why must the replay follow a uniform distribution? The answer lies in the objective function of DQN. In particular, to well define the objective function, we must specify the probability distributions for S,A,R,SS, A, R, S'. The distributions of RR and SS' are determined by the system model once (S,A)(S, A) is given. The simplest way to describe the distribution of the state-action pair (S,A)(S, A) is to assume it to be uniformly distributed. However, the state-action samples may not be uniformly distributed in practice since they are generated as a sample sequence according to the behavior policy. It is necessary to break the correlation between the samples in the sequence to satisfy the assumption of uniform distribution. To do this, we can use the experience replay technique by uniformly drawing samples from the replay buffer. This is the mathematical reason why experience replay is necessary and why experience replay must follow a uniform distribution.

Pseudocode: DQN
Pseudocode: DQN

The reason why previous Sarsa and Q-learning (becomes DQN) do not adopt the experience reply is that we assume we use the linear approximation function. It can be proved that when adotping linear approximation is convergent when the State SS follows the stationary distribution dπ(s)d_{\pi}(s) we talked before. In this circumstance, when the algorithm collects and updates data online in the order of time steps t,t+1,t+2t, t+1, t+2, these samples naturally follow a stationary distribution of the current behavior strategy. Therefore, in the mathematical derivation, the algorithm directly uses the sample sequence naturally generated by the Markov chain to approximate the expectation, without needing to construct an additional distribution.

However, when with deep neural network, both Sarsa and Q-learning should follow experience reply like DQN (Deep Sarsa is hard to implement as it is on-policy, which is contradictory with experience replay, as Sarsa's updates rely on the formula q^(st+1,at+1,wt)\hat{q}(s_{t+1}, a_{t+1}, w_t), where at+1a_{t+1} must be the actual sampled value at st+1s_{t+1} from the current latest policy. If you store data from hundreds of thousands of steps ago in a global replay buffer and then extract it for training, that action at+1a_{t+1} will no longer be the action the current policy would take. Using a replay buffer directly undermines Sarsa's on-policy theoretical foundation, causing the algorithm to evaluate something other than the current policy.).

评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3