Chapter 2 Bellman Equation
Chapter 2 Bellman Equation
This chapter we will introduce two key concepts and one important formula.
Revision
I recommand you reading the motivating examples in the tutorial. Here I will skip this part and directly introduce the concepts.
Before delving into the context, we need to do a revision about previous key concepts.
We learn four key components in RL, mamely state , action , policy and reward .
These four components are comprised of the following process:
This step is governed by the following probability distributions:
- With the policy, take what action under current state
is governed by .
- Which state will be transited after taking action
is governed by .
- Receive how much reward
is governed by .
Notice that $St, A_t, R are all random variables. Thus, We can calculate their expectation.
Understand the mentioned three distribution is crucial for learning the Bellman Equation.
State Value
Definition: The expectation of one specific state's discounted return.
Denoted as
Remarks:
State value is a function of . It is a conditional expectation with the condition that the state starts from .
It is based on the policy . For a different policy, the state value may be different.
It represents the “value” of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained.
Q: What is the relationship between return and state value?
A: The state value is the mean of all possible returns that can be obtained starting from a state. If everything like is deterministic, then state value is the same as return.
Bellman Equation
We have already known the state value that is used to describe the expectation of discounted return gained from current state. Actually from the literal sense, we can learn that the state value is the expectation of reward with all possible actions the agent taken in current state. When taking one aciton, the agent will obtain a reward immediately and go into anther new postion (state). And in that state, the agent will 'benefit' from that state after taking another action and get reward.
So the state value consists of two part, namely the immediate reward and future reward.
The proof is omitted here. I recommand you looking it up in the tutorial.
The Bellman Equation is:
A brief version:
A briefer version:
We can write it into matrix form:
where,
, where ,is the state‐transition matrix.
Assume there are four states, the matrix form goes:

Two examples:


Bellman Equation reflect the relationship between state values. It is a set of equations, not just one.
We can derive state value by solving it.
Ways to solve state value
Given a policy, finding out the corresponding state values is called policy evaluation! It is a fundamental problem in RL. It is the foundation to find better policies. Thus, it is important to understand how to solve the Bellman equation.
- closed-form solution
from the matrix form:
we can get:
Closed-form solution is less used due to the existence of inverse matrix.
- iterative solution
An iterative solution is:
This algorithm leads to a sequence . We can show that
We will introduce it in detail in the following chapter.
Here are two examples showing that how state value is leveraged to evaluate policy.


Action value
From state value to action value:
State value: the average return the agent can get starting from a state.
Action value: the average return the agent can get starting from a state and taking an action.
Why do we care action value? Because we want to know which action is better. This point will be clearer in the following lectures. We will frequently use action values.
Action value is determined by action and state, so the definition is:
depends on .
Action value is also composed of the immediate reward and future reward. The difference between action value and state value is that action value is value evaluation after taking action. It depends on specific action and state while state value is only based on state. It needs to consider all the possible action and the corresponding state transition brought by the action and its following reward.
State value can also be thought as all possible actions time corresponding action value:
State value and action value can be derived from each other according to the above formula.
Example for action value:

Note that like .