Chapter 3 Optimal Policy and Bellman Optimality Equation

RyanLee_ljx...大约 3 分钟

Chapter 3 Optimal Policy and Bellman Optimality Equation

We know that RL's ultimate goal is to find the optimal policy. In this chapter we will show how we obtain optimal policy through Bellman Optimality Equation.

Optimal Policy

The state value could be used to evaluate if a policy is good or not: if

v_{\pi_{1}}(s) \ge v_{\pi_{2}}(s), \ \ \forall s \in \mathcal S

We say policy $\pi_{1}$ is 'better' than $\pi_{2}$ .

v_{\pi^*}(s) \ge v_{\pi}(s), \ \ \forall s \in \mathcal S \ \ under \forall \pi

We say policy $\pi^*$ is the optimal policy.

Here comes the questions:

Does the optimal policy exist?
Is the optimal policy unique?
Is the optimal policy stochastic or deterministic?
How to obtain the optimal policy?

Bellman Optimality Equation (BOE) will give you the answers.

Bellman optimality equation (BOE)

Bellman optimality equation (elementwise form):

\begin{align} v(s) &= \max_{\pi} \sum_{a} \pi(a\mid s) \Bigl( \sum_{r} p(r\mid s,a)\,r + \gamma \sum_{s'} p(s'\mid s,a)\,v(s') \Bigr), \quad \forall s \in \mathcal S,\\ &= \max_{\pi} \sum_{a} \pi(a\mid s)\,q(s,a), \quad s \in \mathcal S. \end{align}

Notes:

$p(r|s, a), p(s_0 |s, a)$ are known.
$v(s), v(s_0)$ are unknown and to be calculated.
$\pi(s)$ can be written with other $\pi(s)$ .

Bellman optimality equation (matrix-vector form):

v = \max_{\pi} (r_{\pi} + \gamma P_{\pi}v)

The expression contains two unknown elements, namely the policy $\pi$ and state value $v$ . So we need find an approach to solve it. But before introducing the solving algorithm, we need to learn some preliminaries through some interesting exapmles.

Motivating examples

As mentioned above, BOE has two unknowns from one equation. How to solve problems like this? See the following exapmle:

提示

Example (How to solve two unknowns from one equation

Okay, we know that the way is to fix one unknown and solve the equation. Suppose we fix $v(s \prime)$ on the rightside of the equation.

\begin{align} v(s) &= \max_{\pi} \sum_{a} \pi(a\mid s) \Bigl( \sum_{r} p(r\mid s,a)\,r + \gamma \sum_{s'} p(s'\mid s,a)\,v(s') \Bigr), \quad \forall s \in \mathcal S,\\ &= \max_{\pi} \sum_{a} \pi(a\mid s)\,q(s,a), \quad s \in \mathcal S. \end{align}

We know that $\max_{\pi} \sum_{a}=1$ . We will need to solve is the maximum with different probability assigned to each action value. So a similar exapmle goes: