跳至主要內容

Chapter 5 Monte Carlo Learning

RyanLee_ljx...大约 4 分钟RL

Chapter 5 Monte Carlo Learning

This chapter we will introduce a model-free approach for deriving optimal policy.

Here, model-free refers that we do not rely on a specific mathematical model to obtain state value or action value. Like, in the policy evaluation, we use BOE to obtain state value, which is just model-based. For model-free, we do not use that equation anymore. Instead, we leverage the mean estimation methods.

Probably the following example can better illustrate

Example Flip a coin

The result (either head or tail) is denoted as a random variable XX.

  • if the result is head, then X=+1X = +1.

  • if the result is tail, then X=1X = −1.

The aim is to compute E[X]\mathbb E[X].

The model-based approach is to calculate the expectation through the definition.

E[X]=xp(x)x=0.5×1+0.5×(1)=0.5 \mathbb E[X] = \sum_{x} p(x)x = 0.5 \times 1 + 0.5 \times (-1) = 0.5

Problem: it may be impossible to know the precise distribution!!

The model-free approach is based on sampling.

We flip the coin many times, and then calculate the average of the outcomes.

Suppose we get a sample sequence: x1,x2,...,xN{x_1, x_2, . . . , x_N}. Then, the mean can be approximated as:

E[X]xˉ=j=1NxjN \mathbb{E[X]} \simeq \bar{x} = \sum_{j=1}^{N} \tfrac{x_j}{N}

This is the idea of Monte Carlo estimation and it is supported by the Law of Large Numbers to be accurate.

Monte Carlo (MC) Basic Algorithm

The policy iteration involves two processes, namely policy evaluation and policy improvement.

As mentioned before, the differences between model-based and model-free is policy evaluation in which how to gain the action value is of paramount importance.

model-based and model-free approach for calculating action value
model-based and model-free approach for calculating action value

Here we use the expression 2, namely estimate the expectation of every state-action pair as their real value.

Here is a thorough statement and the corresponding pseudocode.

Given an initial policy π0\pi_{0}, there are two steps at the kkth iteration.

  1. Step 1 Policy Evaluation: This step is to obtain qπk(s,a)q_{\pi_{k}}(s, a) for all (s,a)(s, a). Specifically, for each action-state pair (s,a)(s, a), run an infinite number of (or sufficiently many) episodes. The average of their returns is used to approximate qπk(s,a)q_{\pi_{k}}(s, a).

  2. Step 2 Policy Improvement: Just like the policy improvement in the policy iteration. Acquire the maximum action value in every state.

Exactly the same as the policy iteration algorithm, except that we estimate $q_{\pi_{k}}(s, a) $ directly, instead of solving vπk(s)v_{\pi_{k}}(s).

MC Basic (a model-free variant of policy iteration)
MC Basic (a model-free variant of policy iteration)

Q

Why does MC Basic estimate action values instead of state values?

  • That is because state values cannot be used to improve policies directly. When models are not available, we should directly estimate action values.

  • Since policy iteration is convergent, the convergence of MC Basic is also guaranteed to be convergent given sufficient episodes.

Monte Carlo (MC) Exploring Starts Algorithm

In MC Basic Algorithm, we have to start from every state-action pair and do many samplings to estimate, which is less efficient. In detail, the episode also visits many other state-action pairs such as (s2,a4),(s2,a3),and(s5,a1)(s_2, a_4), (s_2, a_3), and (s_5, a_1). These visits can also be used to estimate the corresponding action values. In particular, we can decompose the episode into multiple subepisodes:

subepisodes
subepisodes

Compare with MC Basic, MC Exploring Starts sufficiently utilize data. The method goes:

Given a episode, it also focuses on other state-action pairs. Each can be regarded as a start and can be used to do a self-estimation.

For data appears in one episode, there are two methods:

  • first-visit: only the one first appear in the episode will be leveraged to do average.

  • every-visit: no matter how many times it appear, all will be used to do average.

For when to update the policy. Also, there are two methods:

  • The first method is, in the policy evaluation step, to collect all the episodes starting from a state-action pair and then use the average return to approximate the action value.

This is the one adopted by the MC Basic algorithm.

The problem of this method is that the agent has to wait until all episodes have been collected.

  • The second method uses the return of a single episode to approximate the action value. That means we improve the policy right after one episode not all the episodes. We do not care about the accuracy but should make sure all state-action pairs are considered.(by starting from every state-action pair).

In this way, we can improve the policy episode-by-episode.

In fact, this strategy falls into the scope of generalized policy iteration introduced in the last chapter. That is, we can still update the policy even if the value estimate is not sufficiently accurate.

MC Exploring Starts (an efficient variant of MC Basic)
MC Exploring Starts (an efficient variant of MC Basic)

Q: if a state-action pair does not appear in the first episode. How to calculate the average action value?

Monte Carlo (MC) ε\varepsilon-greedy Algorithm

A policy is called soft if the probability to take any action is positive.

With a soft policy, a few episodes that are sufficiently long can visit every state-action pair for sufficiently many times. Then, we do not need to have a large number of episodes starting from every state-action pair. Hence, the requirement of exploring starts (start from every state-action) can thus be removed.

The difference between exploring starts and ε\varepsilon-greedy is just the policy turned from deterministic to stochastic. That is, to integrate ε\varepsilon-greedy policies into MC learning, we only need to change the policy improvement step from greedy to ε\varepsilon-greedy.

-greedy policy
ε\varepsilon-greedy policy
-greedy algorithm
ε\varepsilon-greedy algorithm

It does not require exploring starts, but still requires to visit all state-action pairs in a different form.

There are two parameters, namely the ε\varepsilon and the step length of the episode. All of them will affect the algorithm performance.

We can see from below examples:

example
example
example
example

The feature or advantage of ε\varepsilon-greedy is that it has stronger exploration ability so that the exploring starts condition is not required. But it cannot guarantee policy optimalitly in general since its stochastic characteristic. We can only prove there exists the optimal policy and if we take every action with the largest probability, the policy is the same as the optimal. We can say the ε\varepsilon-greedy policy is consistent with the optimal policy.

Iteration
Iteration
consistency
consistency

We can see when ε\varepsilon is large, the policy is not consistent with optimal policy.

So in practice, the set of ε\varepsilon can not be too large. And after we gain the consistent ε\varepsilon-greedy policy, we turn it into deterministic (fully greedy).

评论
  • 按正序
  • 按倒序
  • 按热度
Powered by Waline v3.1.3