跳至主要內容

Chapter 2 Bellman Equation

RyanLee_ljx...大约 4 分钟RL

Chapter 2 Bellman Equation

This chapter we will introduce two key concepts and one important formula.

Revision

I recommand you reading the motivating examples in the tutorial. Here I will skip this part and directly introduce the concepts.

Before delving into the context, we need to do a revision about previous key concepts.

We learn four key components in RL, mamely state SS, action AA, policy π\pi and reward RR.

These four components are comprised of the following process:

StAtRt+1,St+1 S_t\overset{A_t}{\rightarrow} R_{t+1},S_{t+1}

This step is governed by the following probability distributions:

  1. With the policy, take what action under current state

StAtAtS_t\overset{A_t}{\rightarrow} A_{t} is governed by π(AtSt)\pi(A_{t}|S_{t}).

  1. Which state will be transited after taking action

St,AtAtSt+1S_t, A_t\overset{A_t}{\rightarrow} S_{t+1} is governed by p(St+1St,At)p(S_{t+1}|S_{t}, A_t).

  1. Receive how much reward

St,AtAtRt+1S_t, A_t\overset{A_t}{\rightarrow} R_{t+1} is governed by p(Rt+1St,At)p(R_{t+1}|S_{t}, A_t).

Notice that $St, A_t, R are all random variables. Thus, We can calculate their expectation.

Understand the mentioned three distribution is crucial for learning the Bellman Equation.

State Value

Definition: The expectation of one specific state's discounted return.

Denoted as E[GtSt=s]\mathbb{E} \left [G_t|S_t=s \right ]

Remarks:

  • State value is a function of ss. It is a conditional expectation with the condition that the state starts from ss.

  • It is based on the policy π\pi. For a different policy, the state value may be different.

  • It represents the “value” of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained.

Q: What is the relationship between return and state value?

A: The state value is the mean of all possible returns that can be obtained starting from a state. If everything like π(as),p(rs,a),p(s0s,a)\pi(a|s), p(r|s, a), p(s_0|s, a) is deterministic, then state value is the same as return.

Bellman Equation

We have already known the state value that is used to describe the expectation of discounted return gained from current state. Actually from the literal sense, we can learn that the state value is the expectation of reward with all possible actions the agent taken in current state. When taking one aciton, the agent will obtain a reward immediately and go into anther new postion (state). And in that state, the agent will 'benefit' from that state after taking another action and get reward.

So the state value consists of two part, namely the immediate reward and future reward.

The proof is omitted here. I recommand you looking it up in the tutorial.

The Bellman Equation is:

vπ(s)=aπ(as)rp(rs,a)rmean  of  immediate  reward+γaπ(as)sp(ss,a)vπ(s)mean  of  future  reward v_{\pi}(s) =\underset{mean \ \ of \ \ immediate \ \ reward}{\underbrace{\sum_{a}^{}\pi(a|s)\sum_{r}^{}p(r|s,a)r }} + \underset{mean \ \ of \ \ future \ \ reward}{\underbrace{\gamma \sum_{a}^{}\pi(a|s)\sum_{s'}^{} p(s'|s,a)v_{\pi}(s)}}

A brief version:

vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)] v_{\pi}(s) =\sum_{a}^{}\pi(a|s)[\sum_{r}^{}p(r|s,a)r +\gamma \sum_{s'}^{} p(s'|s,a)v_{\pi}(s)]

vπ(s)=aπ(as)[E[Rt+1St=s,At=a]+γE[Gt+1St=s,At=a]] v_{\pi}(s) = \sum_{a}^{}\pi(a|s)\left [ \mathbb{E} \left [R_{t+1}|S_t=s, A_t=a \right ]+ \gamma\mathbb{E} \left [G_{t+1}|S_t=s, A_t=a \right ] \right ]

A briefer version:

vπ(s)=rπ(s)+γspπ(ss)vπ(s) v_{\pi}(s) = r_{\pi}(s) + \gamma \sum_{s'}{} p_{\pi}(s'|s)v_{\pi}(s')

We can write it into matrix form:

vπ=rπ+γPπvpi v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{pi}

where,

  • vπ=[vπ(s1),,vπ(sn)] ⁣Rnv_\pi = \bigl[v_\pi(s_1), \dots, v_\pi(s_n)\bigr]^{\!\top}\in\mathbb{R}^n

  • rπ=[rπ(s1),,rπ(sn)] ⁣Rnr_\pi = \bigl[r_\pi(s_1), \dots, r_\pi(s_n)\bigr]^{\!\top}\in\mathbb{R}^n

  • PπRn×nP_\pi\in\mathbb{R}^{n\times n}, where ]Pπ]ij=pπ(sjsi)]P_{\pi}]_{ij} = p_{\pi}(s_j|s_i),is the state‐transition matrix.

Assume there are four states, the matrix form goes:

matrix form
matrix form

Two examples:

illustrative example 1
illustrative example 1
illustrative example 2
illustrative example 2

Bellman Equation reflect the relationship between state values. It is a set of equations, not just one.

We can derive state value by solving it.

Ways to solve state value

Given a policy, finding out the corresponding state values is called policy evaluation! It is a fundamental problem in RL. It is the foundation to find better policies. Thus, it is important to understand how to solve the Bellman equation.

  1. closed-form solution

from the matrix form:

vπ(s)=rπ+γPπvπ v_{\pi}(s) = r_{\pi} + \gamma P_{\pi}v_{\pi}

we can get:

vπ(s)=(IγPπ)1rπ v_{\pi}(s) = (\mathbb{I} -\gamma P_{\pi})^{-1}r_{\pi }

Closed-form solution is less used due to the existence of inverse matrix.

  1. iterative solution

An iterative solution is:

vπ(k+1)=rπ+γPπvk v_{\pi}(k+1) = r_{\pi} + \gamma P_{\pi}v_{k}

This algorithm leads to a sequence v0,v1,v2,...{v_0, v_1, v_2, . . . }. We can show that

vkvπ=(IγPπ)1rπ,k v_k \to v_{\pi} = (\mathbb{I} -\gamma P_{\pi})^{-1}r_{\pi }, k \to \infty

We will introduce it in detail in the following chapter.

Here are two examples showing that how state value is leveraged to evaluate policy.

better policy
better policy
worse policy
worse policy

Action value

From state value to action value:

  • State value: the average return the agent can get starting from a state.

  • Action value: the average return the agent can get starting from a state and taking an action.

Why do we care action value? Because we want to know which action is better. This point will be clearer in the following lectures. We will frequently use action values.

Action value is determined by action and state, so the definition is:

qπ(s,a)=E[GtSt=s,At=a] q_{\pi}(s, a) = \mathbb{E[G_t|S_t=s, A_t=a]}

qq depends on π\pi.

qπ(s,a)=rp(rs,a)r+svπ(s)p(ss,a) q_{\pi}(s, a) = \sum_{r}^{} p(r|s,a)r+\sum_{s'}^{}v_{\pi}(s')p(s'|s,a)

Action value is also composed of the immediate reward and future reward. The difference between action value and state value is that action value is value evaluation after taking action. It depends on specific action and state while state value is only based on state. It needs to consider all the possible action and the corresponding state transition brought by the action and its following reward.

State value can also be thought as all possible actions time corresponding action value:

vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]action  value  qπ(s,a) v_{\pi}(s) =\sum_{a}^{}\pi(a|s)\underset{action \ \ value \ \ q_{\pi }(s,a)}{\underbrace{[\sum_{r}^{}p(r|s,a)r +\gamma \sum_{s'}^{} p(s'|s,a)v_{\pi}(s)]} }

State value and action value can be derived from each other according to the above formula.

Example for action value:

action value
action value

Note that like qπ(s1,a3)0q_{\pi}(s_1, a_3) \ne 0.

评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3