跳至主要內容

Chapter 3 Optimal Policy and Bellman Optimality Equation

RyanLee_ljx...大约 3 分钟RL

Chapter 3 Optimal Policy and Bellman Optimality Equation

We know that RL's ultimate goal is to find the optimal policy. In this chapter we will show how we obtain optimal policy through Bellman Optimality Equation.

Optimal Policy

The state value could be used to evaluate if a policy is good or not: if

vπ1(s)vπ2(s),  sS v_{\pi_{1}}(s) \ge v_{\pi_{2}}(s), \ \ \forall s \in \mathcal S

We say policy π1\pi_{1} is 'better' than π2\pi_{2}.

If

vπ(s)vπ(s),  sSunderπ v_{\pi^*}(s) \ge v_{\pi}(s), \ \ \forall s \in \mathcal S under \forall \pi

We say policy π\pi^* is the optimal policy.

Here comes the questions:

  • Does the optimal policy exist?

  • Is the optimal policy unique?

  • Is the optimal policy stochastic or deterministic?

  • How to obtain the optimal policy?

Bellman Optimality Equation (BOE) will give you the answers.

Bellman optimality equation (BOE)

Bellman optimality equation (elementwise form):

v(s)=maxπaπ(as)(rp(rs,a)r+γsp(ss,a)v(s)),sS,=maxπaπ(as)q(s,a),sS. \begin{align} v(s) &= \max_{\pi} \sum_{a} \pi(a\mid s) \Bigl( \sum_{r} p(r\mid s,a)\,r + \gamma \sum_{s'} p(s'\mid s,a)\,v(s') \Bigr), \quad \forall s \in \mathcal S,\\ &= \max_{\pi} \sum_{a} \pi(a\mid s)\,q(s,a), \quad s \in \mathcal S. \end{align}

Notes:

  • p(rs,a),p(s0s,a)p(r|s, a), p(s_0 |s, a) are known.

  • v(s), v(s_0) are unknown and to be calculated.

  • π(s)\pi(s) can be written with other π(s)\pi(s).

Bellman optimality equation (matrix-vector form):

v=maxπ(rπ+γPπv) v = \max_{\pi} (r_{\pi} + \gamma P_{\pi}v)

The expression contains two unknown elements, namely the policy π\pi and state value vv. So we need find an approach to solve it. But before introducing the solving algorithm, we need to learn some preliminaries through some intersting exapmles.

Motivating examples

As mentioned above, BOE has two unknowns from one equation. How to solve problems like this? See the following exapmle:

提示

Example (How to solve two unknowns from one equation
Example (How to solve two unknowns from one equation

Okay, we know that the way is to fix one unknown and solve the equation. Suppose we fix v(s)v(s \prime) on the rightside of the equation.

v(s)=maxπaπ(as)(rp(rs,a)r+γsp(ss,a)v(s)),sS,=maxπaπ(as)q(s,a),sS. \begin{align} v(s) &= \max_{\pi} \sum_{a} \pi(a\mid s) \Bigl( \sum_{r} p(r\mid s,a)\,r + \gamma \sum_{s'} p(s'\mid s,a)\,v(s') \Bigr), \quad \forall s \in \mathcal S,\\ &= \max_{\pi} \sum_{a} \pi(a\mid s)\,q(s,a), \quad s \in \mathcal S. \end{align}

We know that maxπa=1\max_{\pi} \sum_{a}=1. We will need to solve is the maximum with different probability assigned to each action value. So a similar exapmle goes:

提示

Example (How to solve max)
Example (How to solve max)

So through that example, we will know how to gain optimal policy. That is to adopt the action with largest action value in all state.

gain
gain

Solve the Bellman optimality equation

Preliminaries

  • Fixed point: xXx \in X is a fixed point of f:xXf : x \to X if :

x=f(x) x = f(x)

  • Contraction mapping: ff is a contraction mapping if:

f(x1)f(x2)γx1x2,  γ<1 \left | \right | f(x_1)- f(x_2) \left | \right | \le \gamma \left | \right | x_1- x_2 \left | \right |, \ \ \gamma < 1

提示

example
example

So here we can introduce the important theorem:

重要

Contraction mapping theorem
Contraction mapping theorem

Examples like: x=0.5xx = 0.5x ,xk+1=0.5xkx_{k+1} = 0.5x_k . Suppose that x0=10x_0 = 10 So x1=5,x2=2.5......x0x_1 = 5, x_2 = 2.5 ...... x \to 0

Contraction property of BOE

The Bellman Equation:

v=maxπ(rπ+γPπv) v = \max_{\pi} (r_{\pi} + \gamma P_{\pi}v)

can be regarded as the function of vv. So it can write like:

v=f(v) v=f(v)

And $f(v) is a contraction mapping.

So we can utilize the Contraction mapping theorem to solve BOE.

Applying the contraction mapping theorem to solve BOE
Applying the contraction mapping theorem to solve BOE

Iterative algorithm

Matrix vector form:

vk+1  =  f(vk)  =  maxπ(rπ+γPπvk) v_{k+1} \;=\; f(v_k) \;=\;\max_{\pi}\bigl(r_{\pi} + \gamma\,P_{\pi}\,v_k\bigr)

Elementwise form:

vk+1(s)=maxπaπ(a ⁣ ⁣s)(rp(r ⁣ ⁣s,a)r+γsp(s ⁣ ⁣s,a)vk(s)),=maxπaπ(a ⁣ ⁣s)qk(s,a),=maxaqk(s,a). \begin{align*} v_{k+1}(s) &= \max_{\pi}\sum_{a}\pi(a\!\mid\!s)\Bigl(\sum_{r}p(r\!\mid\!s,a)\,r + \gamma\sum_{s'}p(s'\!\mid\!s,a)\,v_k(s')\Bigr),\\ &= \max_{\pi}\sum_{a}\pi(a\!\mid\!s)\,q_k(s,a),\\ &= \max_{a}q_k(s,a). \end{align*}

The procedure goes:

procedure
procedure

提示

example
example

Actions: al,a0,ara_l , a_0, a_r represent go left, stay unchanged, and go right.

Reward: entering the target area: +1; try to go out of boundary -1.

example
example
example
example
example
example

Factors determine the optimal policy

  • Reward design: rr

  • System model: p(s0s,a),p(rs,a)p(s_0|s, a), p(r|s, a)

  • Discount rate: γ\gamma, affecting whether the model is short-sighted or not.

  • $v(s), v(s_0), π(a|s) are unknowns to be calculate.

评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3