Value Iteration and Policy Iteration

RyanLee_ljx...大约 2 分钟

Value Iteration and Policy Iteration

In the last chapter, we study the Bellman Optimality Equation and introduce the iterative algorithm. This chapter we will introduce three model-based approach for deriving optimal policy. I recommand read the pdf tutorial by yourself. In this blog I will mainly focus on the difference between value iteration, policy iteration and truncated policy iteration.

Value Iteration

v = f(v) = \max_{\pi} r_{\pi} + \gamma P_{\pi}v

In the last chapter, we know that the contraction mapping theorem suggest an iterative algorithm:

v_{k+1} = f(v_K) = \max_{\pi} r_{\pi} + \gamma P_{\pi}v_k, \ \ k=1,2,3......

$v_0$ can be arbitrary. We know we solve this equation with the value iteration.

The algorithm goes two steps:

Step 1 Policy Update (PU)

Given initial value $v_0(s)$ , calculate action value $q$ . For any state, choose largest action value obtaining the updated policy.

Note that this value is not state value. It is just a interimmediate value.

Step 2 Value Update (VU)

Given the new policy, calculate new iterative value $v_1(s)$ . Since the policy is deterministic, the new value is equal to the largest action value.

The procedure can be summarized as:

v_k(s) \to q(v_k(s), a) \to \pi(k+1)(a|s) = \arg\max_{a} q(v_k(s), a) \to v_{k+1}(s) = \max_{a} q(v_k(s), a)

Policy Iteration

Step 1 Policy Evaluation (PE)

Namely the calculation of state value through iterative algorithm.

v_{\pi_{k}} = r_{\pi_{k}} + \gamma P_{\pi_{k}}v_{\pi_{k}}

Note that this step is just an iteration in the policy iteration, namaly the nested iteration in the policy iteration.

Step 1 Policy Improvement (PI)

Now we gain the state value of current policy. We can obtain action value through

q_{\pi}(s, a) = \sum_{r}^{} p(r|s,a)r+\sum_{s'}^{}v_{\pi}(s')p(s'|s,a)

Then improved policy is

\pi(k+1)(a|s) = \arg\max_{a} q(v_k(s), a)

Or in one equation

\pi_{k+1} = \arg\max_{\pi} r_{\pi} + \gamma P_{\pi}v_{\pi_{k}}

Which is the same as the formula before since acquiring maximum action value is just acquiring current optimal policy.

Note that here $P$ is $P_{\pi}$ not $P_{\pi_{k}}$ in the value update in value iteration since $\pi$ is what we calculate now. It is unknown. We need to derive it through maximum the equation, which is finding the maximum action value.

Differences between Value Iteration and Policy Iteration

The key differences between Value Iteration and Policy Iteration is

Value iteration has one iterative process while Policy Iteration has two.
Value iteration do one iteration, update policy according to the interimmediate value at once and then continue the value iterative process and finally obtain the policy through this process. Policy iteration do a full iterative process to obtain real state value and then derive a new policy according to the state value and finnaly derive the policy.

Truncated policy iteration algorithm

The truncated policy iteration is just one combination of the two. Or in other words, the policy iteration and value iteration are just one extreme example of truncated policy iteration.