Chapter 3 Optimal Policy and Bellman Optimality Equation
Chapter 3 Optimal Policy and Bellman Optimality Equation
We know that RL's ultimate goal is to find the optimal policy. In this chapter we will show how we obtain optimal policy through Bellman Optimality Equation.
Optimal Policy
The state value could be used to evaluate if a policy is good or not: if
We say policy is 'better' than .
If
We say policy is the optimal policy.
Here comes the questions:
Does the optimal policy exist?
Is the optimal policy unique?
Is the optimal policy stochastic or deterministic?
How to obtain the optimal policy?
Bellman Optimality Equation (BOE) will give you the answers.
Bellman optimality equation (BOE)
Bellman optimality equation (elementwise form):
Notes:
are known.
v(s), v(s_0) are unknown and to be calculated.
can be written with other .
Bellman optimality equation (matrix-vector form):
The expression contains two unknown elements, namely the policy and state value . So we need find an approach to solve it. But before introducing the solving algorithm, we need to learn some preliminaries through some intersting exapmles.
Motivating examples
As mentioned above, BOE has two unknowns from one equation. How to solve problems like this? See the following exapmle:
提示

Okay, we know that the way is to fix one unknown and solve the equation. Suppose we fix on the rightside of the equation.
We know that . We will need to solve is the maximum with different probability assigned to each action value. So a similar exapmle goes:
提示

So through that example, we will know how to gain optimal policy. That is to adopt the action with largest action value in all state.

Solve the Bellman optimality equation
Preliminaries
- Fixed point: is a fixed point of if :
- Contraction mapping: is a contraction mapping if:
提示

So here we can introduce the important theorem:
重要

Examples like: , . Suppose that So
Contraction property of BOE
The Bellman Equation:
can be regarded as the function of . So it can write like:
And $f(v) is a contraction mapping.
So we can utilize the Contraction mapping theorem to solve BOE.

Iterative algorithm
Matrix vector form:
Elementwise form:
The procedure goes:

提示

Actions: represent go left, stay unchanged, and go right.
Reward: entering the target area: +1; try to go out of boundary -1.



Factors determine the optimal policy
Reward design:
System model:
Discount rate: , affecting whether the model is short-sighted or not.
$v(s), v(s_0), π(a|s) are unknowns to be calculate.