跳至主要內容

Value Iteration and Policy Iteration

RyanLee_ljx...大约 2 分钟RL

Value Iteration and Policy Iteration

In the last chapter, we study the Bellman Optimality Equation and introduce the iterative algorithm. This chapter we will introduce three model-based approach for deriving optimal policy. I recommand read the pdf tutorial by yourself. In this blog I will mainly focus on the difference between value iteration, policy iteration and truncated policy iteration.

Value Iteration

v=f(v)=maxπrπ+γPπv v = f(v) = \max_{\pi} r_{\pi} + \gamma P_{\pi}v

In the last chapter, we know that the contraction mapping theorem suggest an iterative algorithm:

vk+1=f(vK)=maxπrπ+γPπvk,  k=1,2,3...... v_{k+1} = f(v_K) = \max_{\pi} r_{\pi} + \gamma P_{\pi}v_k, \ \ k=1,2,3......

v0v_0 can be arbitrary. We know we solve this equation with the value iteration.

The algorithm goes two steps:

  1. Step 1 Policy Update (PU)

Given initial value v0(s)v_0(s), calculate action value qq. For any state, choose largest action value obtaining the updated policy.

Note that this value is not state value. It is just a interimmediate value.

  1. Step 2 Value Update (VU)

Given the new policy, calculate new iterative value v1(s)v_1(s). Since the policy is deterministic, the new value is equal to the largest action value.

Value Iteration
Value Iteration
Value Iteration
Value Iteration

The procedure can be summarized as:

vk(s)q(vk(s),a)π(k+1)(as)=argmaxaq(vk(s),a)vk+1(s)=maxaq(vk(s),a) v_k(s) \to q(v_k(s), a) \to \pi(k+1)(a|s) = \arg\max_{a} q(v_k(s), a) \to v_{k+1}(s) = \max_{a} q(v_k(s), a)

Pseudocode: Value iteration algorithm
Pseudocode: Value iteration algorithm

Policy Iteration

  1. Step 1 Policy Evaluation (PE)

Namely the calculation of state value through iterative algorithm.

vπk=rπk+γPπkvπk v_{\pi_{k}} = r_{\pi_{k}} + \gamma P_{\pi_{k}}v_{\pi_{k}}

Note that this step is just an iteration in the policy iteration, namaly the nested iteration in the policy iteration.

  1. Step 1 Policy Improvement (PI)

Now we gain the state value of current policy. We can obtain action value through

qπ(s,a)=rp(rs,a)r+svπ(s)p(ss,a) q_{\pi}(s, a) = \sum_{r}^{} p(r|s,a)r+\sum_{s'}^{}v_{\pi}(s')p(s'|s,a)

Then improved policy is

π(k+1)(as)=argmaxaq(vk(s),a) \pi(k+1)(a|s) = \arg\max_{a} q(v_k(s), a)

Or in one equation

πk+1=argmaxπrπ+γPπvπk \pi_{k+1} = \arg\max_{\pi} r_{\pi} + \gamma P_{\pi}v_{\pi_{k}}

Which is the same as the formula before since acquiring maximum action value is just acquiring current optimal policy.

Note that here PP is PπP_{\pi} not PπkP_{\pi_{k}} in the value update in value iteration since π\pi is what we calculate now. It is unknown. We need to derive it through maximum the equation, which is finding the maximum action value.

Policy Iteration Process
Policy Iteration Process
Q1
Q1
Q2
Q2
Q3
Q3
Policy Iteration
Policy Iteration
Policy Iteration
Policy Iteration
Pseudocode: Policy iteration algorithm
Pseudocode: Policy iteration algorithm

Differences between Value Iteration and Policy Iteration

The key differences between Value Iteration and Policy Iteration is

  • Value iteration has one iterative process while Policy Iteration has two.

  • Value iteration do one iteration, update policy according to the interimmediate value at once and then continue the value iterative process and finally obtain the policy through this process. Policy iteration do a full iterative process to obtain real state value and then derive a new policy according to the state value and finnaly derive the policy.

Comparison between VI and PI
Comparison between VI and PI
Comparison between VI and PI
Comparison between VI and PI
Comparison between VI and PI
Comparison between VI and PI

Truncated policy iteration algorithm

The truncated policy iteration is just one combination of the two. Or in other words, the policy iteration and value iteration are just one extreme example of truncated policy iteration.

truncated policy iteration
truncated policy iteration
Pseudocode: Truncated policy iteration algorithm
Pseudocode: Truncated policy iteration algorithm

For the convergence:

Proposition (Value Improvement)
Proposition (Value Improvement)
llustration
llustration

So truncated policy iteration is actually a trade-off.

评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3