跳至主要內容

Chapter 1 Basic Concepts of Reinforcement Learning

RyanLee_ljx...大约 2 分钟RL

Chapter 1 Basic Concepts of Reinforcement Learning

Reinforcement Learning (RL) can be described by the grid world example.

We place one agent in an environment, the goal of the agent is to find a good route to the target. Every cell/grid the agent placed can be seen as a state. Agent can take one action at each state according to a certain policy. The goal of RL is to find a good policy to guide the agent taking a sequence of acitons, travelling from the start place, moving from one state to another, and finally reach the target.

grid world
grid world

Markov decision process

Reinforcement learning utilizes the MDP framework to model the interaction between a learning agent and its environment. Actually the above example is just one simple description of the Markov decision preocess. It reflects how the agent interacts with the environment.

The key concepts involves:

State

State is the status of the agent with respect to the environment. Its set is the state space $ S = {s_i} $.

Action

Action is what the agent do at a certain state. The agent will obtain a new state after taking one aciton. Similarly, its set is the action space of a state denoted as $ A(s_i) = {a_i} $

Policy

Policy denoted as $ \pi $, tells the agent what actions to take at a state. It gives the probability of each action to be taken at a certain state. In mathematical form ,we use tabular representation to display one policy. In programming, we use one array, matrix to represent a policy

tabular representation
tabular representation

Reward

Reward guides the agent to our target. Agent wants more reward, which means the agent will minimize (the reward is negative) or maximum (the reward is postive) the reward in the process.

Reward depends on the current state and action not the next state.

Probability Distribution

Involve two probability form:

  • State transition probability: at state ss, taking action aa, the probability to transit to state ss' is p(ss,a)p(s'|s,a)
  • Reward probability: at state ss, taking action aa, the probability to get reward rr is p(rs,a)p(r|s, a)

Markov Property

Memoryless property: The state transiting to the next depends on current state and action rather than previous.

Other concepts

Trajectory and Return

A trajectory is a state-action-reward chain:

chain
chain

The return of this trajectory is the sum of all the rewards collected along the trajectory. It can be finite, e.g., transit from target to target s1s2s3s5s5s5...s_1 \to s_2 \to s_3 \to s_5 \to s_5 \to s_5....

return=0+0+0+1=1 return = 0 + 0 + 0 + 1 = 1

Discounted Rate

discount rate γ[0,1)γ \in [0, 1)

Roles: 1) the sum of the return becomes finite instead of infinite; 2) balance the far and near future rewards:

  • If γγ is close to 0, the value of the discounted return is dominated by the rewards obtained in the near future.

  • If γγ is close to 1, the value of the discounted return is dominated by the rewards obtained in the far future

Episode

When interacting with the environment following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode (or a trial), e.g. s1s2s3s5s_1 \to s_2 \to s_3 \to s_5,

An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks.

We can treat episodic and continuing tasks in a unified mathematical way by converting episodic tasks to continuing tasks:

  • Option 1: Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards r = 0.

  • Option 2: Treat the target state as a normal state with a policy. The agent can still leave the target state and gain r = +1 when entering the target state.

This tutorial course considers option 2 so as to not distinguish the target state from the others and can treat it as a normal state.

评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3