TD learning refers a wide range of algorithms.
TD algorithm can solve Bellman equation of a given policy without model.
TD learning refers a wide range of algorithms.
TD algorithm can solve Bellman equation of a given policy π without model.
Stochastic Approximation (SA) refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems. Compared to many other root-finding algorithms such as gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.
This chapter we will introduce a model-free approach for deriving optimal policy.
Here, model-free refers that we do not rely on a specific mathematical model to obtain state value or action value. Like, in the policy evaluation, we use BOE to obtain state value, which is just model-based. For model-free, we do not use that equation anymore. Instead, we leverage the mean estimation methods.
In the last chapter, we study the Bellman Optimality Equation. This chapter we will introduce three model-based, iterative algorithm —— value iteration, policy iteration and truncated policy iteration —— for solving the BOE to derive the optimal policy,
We know that RL's ultimate goal is to find the optimal policy. In this chapter we will show how we obtain optimal policy through Bellman Optimality Equation.
The state value could be used to evaluate if a policy is good or not: if
vπ1(s)≥vπ2(s), ∀s∈S
This chapter we will introduce two key concepts and one important formula.
I recommand you reading the motivating examples in the tutorial. Here I will skip this part and directly introduce the concepts.
Before delving into the context, we need to do a revision about previous key concepts.
This blog is mainly a notebook of Mathematical Foundations of Reinforcement Learning by Shiyu Zhao from Westlake University WindyLab.
You can find more about the book and related tutorial videos at this link.
Reinforcement Learning (RL) can be described by the grid world example.
We place one agent in an environment, the goal of the agent is to find a good route to the target. Every cell/grid the agent placed can be seen as a state. Agent can take one action at each state according to a certain policy. The goal of RL is to find a good policy to guide the agent taking a sequence of acitons, travelling from the start place, moving from one state to another, and finally reach the target.