Value Iteration and Policy Iteration
Value Iteration and Policy Iteration
In the last chapter, we study the Bellman Optimality Equation and introduce the iterative algorithm. This chapter we will introduce three model-based approach for deriving optimal policy. I recommand read the pdf tutorial by yourself. In this blog I will mainly focus on the difference between value iteration, policy iteration and truncated policy iteration.
Value Iteration
In the last chapter, we know that the contraction mapping theorem suggest an iterative algorithm:
can be arbitrary. We know we solve this equation with the value iteration.
The algorithm goes two steps:
- Step 1 Policy Update (PU)
Given initial value , calculate action value . For any state, choose largest action value obtaining the updated policy.
Note that this value is not state value. It is just a interimmediate value.
- Step 2 Value Update (VU)
Given the new policy, calculate new iterative value . Since the policy is deterministic, the new value is equal to the largest action value.


The procedure can be summarized as:

Policy Iteration
- Step 1 Policy Evaluation (PE)
Namely the calculation of state value through iterative algorithm.
Note that this step is just an iteration in the policy iteration, namaly the nested iteration in the policy iteration.
- Step 1 Policy Improvement (PI)
Now we gain the state value of current policy. We can obtain action value through
Then improved policy is
Or in one equation
Which is the same as the formula before since acquiring maximum action value is just acquiring current optimal policy.
Note that here is not in the value update in value iteration since is what we calculate now. It is unknown. We need to derive it through maximum the equation, which is finding the maximum action value.







Differences between Value Iteration and Policy Iteration
The key differences between Value Iteration and Policy Iteration is
Value iteration has one iterative process while Policy Iteration has two.
Value iteration do one iteration, update policy according to the interimmediate value at once and then continue the value iterative process and finally obtain the policy through this process. Policy iteration do a full iterative process to obtain real state value and then derive a new policy according to the state value and finnaly derive the policy.



Truncated policy iteration algorithm
The truncated policy iteration is just one combination of the two. Or in other words, the policy iteration and value iteration are just one extreme example of truncated policy iteration.


For the convergence:


So truncated policy iteration is actually a trade-off.