Chapter 2 Bellman Equation

RyanLee_ljx...大约 4 分钟

Chapter 2 Bellman Equation

This chapter we will introduce two key concepts and one important formula.

Revision

I recommand you reading the motivating examples in the tutorial. Here I will skip this part and directly introduce the concepts.

Before delving into the context, we need to do a revision about previous key concepts.

We learn four key components in RL, mamely state $S$ , action $A$ , policy $\pi$ and reward $R$ .

These four components are comprised of the following process:

S_t\overset{A_t}{\rightarrow} R_{t+1},S_{t+1}

This step is governed by the following probability distributions:

With the policy, take what action under current state

$S_t\overset{A_t}{\rightarrow} A_{t}$ is governed by $\pi(A_{t}|S_{t})$ .

Which state will be transited after taking action

$S_t, A_t\overset{A_t}{\rightarrow} S_{t+1}$ is governed by $p(S_{t+1}|S_{t}, A_t)$ .

Receive how much reward

$S_t, A_t\overset{A_t}{\rightarrow} R_{t+1}$ is governed by $p(R_{t+1}|S_{t}, A_t)$ .

Notice that $St, A_t, R are all random variables. Thus, We can calculate their expectation.

Understand the mentioned three distribution is crucial for learning the Bellman Equation.

State Value

Definition: The expectation of one specific state's discounted return.

Denoted as $\mathbb{E} \left [G_t|S_t=s \right ]$

Remarks:

State value is a function of $s$ . It is a conditional expectation with the condition that the state starts from $s$ .
It is based on the policy $\pi$ . For a different policy, the state value may be different.
It represents the “value” of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained.

Q: What is the relationship between return and state value?

A: The state value is the mean of all possible returns that can be obtained starting from a state. If everything like $\pi(a|s), p(r|s, a), p(s_0|s, a)$ is deterministic, then state value is the same as return.

Bellman Equation

We have already known the state value that is used to describe the expectation of discounted return gained from current state. Actually from the literal sense, we can learn that the state value is the expectation of reward with all possible actions the agent taken in current state. When taking one aciton, the agent will obtain a reward immediately and go into anther new postion (state). And in that state, the agent will 'benefit' from that state after taking another action and get reward.

So the state value consists of two part, namely the immediate reward and future reward.

The proof is omitted here. I recommand you looking it up in the tutorial.

The Bellman Equation is:

v_{\pi}(s) =\underset{mean \ \ of \ \ immediate \ \ reward}{\underbrace{\sum_{a}^{}\pi(a|s)\sum_{r}^{}p(r|s,a)r }} + \underset{mean \ \ of \ \ future \ \ reward}{\underbrace{\gamma \sum_{a}^{}\pi(a|s)\sum_{s'}^{} p(s'|s,a)v_{\pi}(s)}}, \ \ \forall s \in \mathcal{S}

A brief version:

v_{\pi}(s) =\sum_{a}^{}\pi(a|s)[\sum_{r}^{}p(r|s,a)r +\gamma \sum_{s'}^{} p(s'|s,a)v_{\pi}(s)] \ \ \forall s \in \mathcal{S}

v_{\pi}(s) = \sum_{a}^{}\pi(a|s)\left [ \mathbb{E} \left [R_{t+1}|S_t=s, A_t=a \right ]+ \gamma\mathbb{E} \left [G_{t+1}|S_t=s, A_t=a \right ] \right ] \ \ \forall s \in \mathcal{S}

A briefer version:

v_{\pi}(s) = r_{\pi}(s) + \gamma \sum_{s'}{} p_{\pi}(s'|s)v_{\pi}(s') \ \ \forall s \in \mathcal{S}

We can write it into matrix form:

v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{pi} \ \ \forall s \in \mathcal{S}

where,

$v_\pi = \bigl[v_\pi(s_1), \dots, v_\pi(s_n)\bigr]^{\!\top}\in\mathbb{R}^n$
$r_\pi = \bigl[r_\pi(s_1), \dots, r_\pi(s_n)\bigr]^{\!\top}\in\mathbb{R}^n$
$P_\pi\in\mathbb{R}^{n\times n}$ , where $]P_{\pi}]_{ij} = p_{\pi}(s_j|s_i)$ ,is the state‐transition matrix.

Assume there are four states, the matrix form goes:

Two examples:

Bellman Equation reflect the relationship between state values. It is a set of equations, not just one.

We can derive state value by solving it.

Ways to solve state value

Given a policy, finding out the corresponding state values is called policy evaluation! It is a fundamental problem in RL. It is the foundation to find better policies. Thus, it is important to understand how to solve the Bellman equation.

closed-form solution

from the matrix form:

v_{\pi}(s) = r_{\pi} + \gamma P_{\pi}v_{\pi}

we can get:

v_{\pi}(s) = (\mathbb{I} -\gamma P_{\pi})^{-1}r_{\pi }

Closed-form solution is less used due to the existence of inverse matrix.

iterative solution

An iterative solution is:

v_{\pi}(k+1) = r_{\pi} + \gamma P_{\pi}v_{k}

This algorithm leads to a sequence ${v_0, v_1, v_2, . . . }$ . We can show that

v_k \to v_{\pi} = (\mathbb{I} -\gamma P_{\pi})^{-1}r_{\pi }, k \to \infty

We will introduce it in detail in the following chapter.

Here are two examples showing that how state value is leveraged to evaluate policy.

Action value

From state value to action value:

State value: the average return the agent can get starting from a state.
Action value: the average return the agent can get starting from a state and taking an action.

Why do we care action value? Because we want to know which action is better. This point will be clearer in the following lectures. We will frequently use action values.

Action value is determined by action and state, so the definition is:

q_{\pi}(s, a) = \mathbb{E[G_t|S_t=s, A_t=a]}

$q$ depends on $\pi$ .

q_{\pi}(s, a) = \sum_{r}^{} p(r|s,a)r+\sum_{s'}^{}v_{\pi}(s')p(s'|s,a)

Action value is also composed of the immediate reward and future reward. The difference between action value and state value is that action value is value evaluation after taking action. It depends on specific action and state while state value is only based on state. It needs to consider all the possible action and the corresponding state transition brought by the action and its following reward.

State value can also be thought as all possible actions time corresponding action value:

v_{\pi}(s) =\sum_{a}^{}\pi(a|s)\underset{action \ \ value \ \ q_{\pi }(s,a)}{\underbrace{[\sum_{r}^{}p(r|s,a)r +\gamma \sum_{s'}^{} p(s'|s,a)v_{\pi}(s)]} }

State value and action value can be derived from each other according to the above formula.

Example for action value:

Note that like $q_{\pi}(s_1, a_3) \ne 0$ .

昵称

邮箱

网址

按正序
按倒序
按热度

Chapter 2 Bellman Equation

Chapter 2 Bellman Equation

Revision

State Value

Bellman Equation

Ways to solve state value

Action value

预览: