Skip to the content.

An introduction to Reinforcement Learning


Contact me

Blog -> https://cugtyt.github.io/blog/index
Email -> cugtyt@qq.com
GitHub -> Cugtyt@GitHub


本系列博客主页及相关见此处

来自Thomas Simonini Deep Reinforcement Learning Course Part 1: An introduction to Reinforcement Learning


The Reinforcement Learning Process

This RL loop outputs a sequence of state, action and reward.

The goal of the agent is to maximize the expected cumulative reward.

The central idea of the Reward Hypothesis

\[G_t = R_{t+1} + R_{t+2} + \dots\]

等价于:

\[G_t = \sum_{k=0}^T R_{t+k+1}\]

考虑到reward的时延性,引入折扣discount,取值在0-1:

折扣累计为:

\[G_t = \sum_{k=0}^\infty R_{t+k+1} where \gamma \in [0,1)\]

Episodic or Continuing tasks

Episodic task

we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and New States.

Continuous tasks

These are tasks that continue forever (no terminal state). In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment.

The agent keeps running until we decide to stop him.

Monte Carlo vs TD Learning methods

Monte Carlo

When the episode ends (the agent reaches a “terminal state”), the agent looks at the total cumulative reward to see how well it did. In Monte Carlo approach, rewards are only received at the end of the game.

Then, we start a new game with the added knowledge. The agent makes better decisions with each iteration.

monte-carlo

mouse-cat1

By running more and more episodes, the agent will learn to play better and better.

Temporal Difference Learning : learning at each time step

TD Learning, on the other hand, will not wait until the end of the episode to update the maximum expected future reward estimation: it will update its value estimation V for the non-terminal states St occurring at that experience.

This method is called TD(0) or one step TD (update the value function after any individual step).

monte-carlo-td-learning-eq

TD methods only wait until the next time step to update the value estimates. At time t+1 they immediately form a TD target using the observed reward Rt+1 and the current estimate V(St+1).

TD target is an estimation: in fact you update the previous estimate V(St) by updating it towards a one-step target.

Exploration/Exploitation trade off

Three approaches to Reinforcement Learning

Value Based

In value-based RL, the goal is to optimize the value function V(s).

The value function is a function that tells us the maximum expected future reward the agent will get at each state.

value-base-eq

Policy Based

In policy-based RL, we want to directly optimize the policy function π(s) without using a value function.

policy-base

We have two types of policy:

stochastic-policy-eq

Model Based

In model-based RL, we model the environment. This means we create a model of the behavior of the environment.

The problem is each environment will need a different model representation.

QN-DQN