Chapter 1 Basic Concepts of Reinforcement Learning
Chapter 1 Basic Concepts of Reinforcement Learning
Reinforcement Learning (RL) can be described by the grid world example.
We place one agent in an environment, the goal of the agent is to find a good route to the target. Every cell/grid the agent placed can be seen as a state. Agent can take one action at each state according to a certain policy. The goal of RL is to find a good policy to guide the agent taking a sequence of acitons, travelling from the start place, moving from one state to another, and finally reach the target.

Markov decision process
Reinforcement learning utilizes the MDP framework to model the interaction between a learning agent and its environment. Actually the above example is just one simple description of the Markov decision preocess. It reflects how the agent interacts with the environment.
The key concepts involves:
State
State is the status of the agent with respect to the environment. Its set is the state space $ S = {s_i} $.
Action
Action is what the agent do at a certain state. The agent will obtain a new state after taking one aciton. Similarly, its set is the action space of a state denoted as $ A(s_i) = {a_i} $
Policy
Policy denoted as $ \pi $, tells the agent what actions to take at a state. It gives the probability of each action to be taken at a certain state. In mathematical form ,we use tabular representation to display one policy. In programming, we use one array, matrix to represent a policy

Reward
Reward guides the agent to our target. Agent wants more reward, which means the agent will minimize (the reward is negative) or maximum (the reward is postive) the reward in the process.
Reward depends on the current state and action not the next state.
Probability Distribution
Involve two probability form:
- State transition probability: at state , taking action , the probability to transit to state is
- Reward probability: at state , taking action , the probability to get reward is
Markov Property
Memoryless property: The state transiting to the next depends on current state and action rather than previous.
Other concepts
Trajectory and Return
A trajectory is a state-action-reward chain:

The return of this trajectory is the sum of all the rewards collected along the trajectory. It can be finite, e.g., transit from target to target .
Discounted Rate
discount rate
Roles: 1) the sum of the return becomes finite instead of infinite; 2) balance the far and near future rewards:
If is close to 0, the value of the discounted return is dominated by the rewards obtained in the near future.
If is close to 1, the value of the discounted return is dominated by the rewards obtained in the far future
Episode
When interacting with the environment following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode (or a trial), e.g. ,
An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks.
We can treat episodic and continuing tasks in a unified mathematical way by converting episodic tasks to continuing tasks:
Option 1: Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards r = 0.
Option 2: Treat the target state as a normal state with a policy. The agent can still leave the target state and gain r = +1 when entering the target state.
This tutorial course considers option 2 so as to not distinguish the target state from the others and can treat it as a normal state.