Skip to the content.

An introduction to Policy Gradients with Cartpole and Doom

Contact me


本系列博客主页及相关见此处

来自Thomas Simonini Deep Reinforcement Learning Course Part 4: An introduction to Policy Gradients with Cartpole and Doom


As a consequence, in value-based learning, a policy exists only because of these action-value estimates.

In policy-based methods, instead of learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the policy function that maps state to action (select actions without using a value function).

Why using Policy-Based methods?

Two types of policy

A policy can be either deterministic or stochastic.

A deterministic policy is policy that maps state to actions. You give it a state and the function returns an action to take.

determin-policy

a stochastic policy outputs a probability distribution over actions.

stochastic-policy

Most of the time we’ll use this second type of policy.

Advantages

The problem with value-based methods is that they can have a big oscillation while training. This is because the choice of action may change dramatically for an arbitrarily small change in the estimated action values.

with policy gradient, we just follow the gradient to find the best parameters. We see a smooth update of our policy at each step.

policy-convergence

The problem with Deep Q-learning is that their predictions assign a score (maximum expected future reward) for each possible action, at each time step, given the current state.

But what if we have an infinite possibility of actions?

This has two consequences.

One of these is that we don’t need to implement an exploration/exploitation trade off. it outputs a probability distribution over actions. As a consequence, it handles the exploration/exploitation trade off without hard coding it.

We also get rid of the problem of perceptual aliasing. Perceptual aliasing is when we have two states that seem to be (or actually are) the same, but need different actions.

perceptual-aliasing

Disadvantages

A lot of the time, they converge on a local maximum rather than on the global optimum.

However, we’ll see there are solutions to this problem.

policy-search-eq1

policy-search-eq2

There are two steps:

First Step: the Policy Score function J(θ)

Three methods work equally well for optimizing policies. The choice depends only on the environment and the objectives you have.

First, in an episodic environment, we can use the start value. Calculate the mean of the return from the first time step (G1). This is the cumulative discounted reward for the entire episode.

policy-function-1

In a continuous environment, we can use the average value, because we can’t rely on a specific start state.

policy-function-2

Each state value is now weighted (because some happen more than others) by the probability of the occurrence of the respected state.

Third, we can use the average reward per time step. The idea here is that we want to get the most reward per time step.

policy-function-3

Second step: Policy gradient ascent

gradient-ascent-1

gradient-ascent-2

gradient-ascent-3

gradient-ascent-4

how do we find the effect of policy on the state distribution? The function of the environment is unknown.

As a consequence, we face a problem: how do we estimate the ∇ (gradient) with respect to policy θ, when the gradient depends on the unknown effect of policy changes on the state distribution?

gradient-ascent-5

gradient-ascent-6

gradient-ascent-7

gradient-ascent-8

gradient-ascent-9

R(tau) is like a scalar value score:

Monte Carlo Policy Gradients

We use Monte Carlo because our tasks can be divided into episodes.

Initialize θ
for each episode τ = S0, A0, R1, S1, …, ST:
    for t <-- 1 to T-1:
        Δθ = α ∇theta(log π(St, At, θ)) Gt
        θ = θ + Δθ

For each episode:
    At each time step within that episode:
         Compute the log probabilities produced by our policy function. Multiply it by the score function.
         Update the weights

Because we only calculate R at the end of the episode, we average all actions. Even if some of the actions taken were very bad, if our score is quite high, we will average all the actions as good.

So to have a correct policy, we need a lot of samples… which results in slow learning.

How to improve our Model?

play Cartpole代码

play Doom代码