Skip to the content.

An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!

Contact me


本系列博客主页及相关见此处

来自Thomas Simonini Deep Reinforcement Learning Course Part 5: An intro to Advantage Actor Critic methods: let’s play Sonic the Hedgehog!


we’ve studied two different reinforcement learning methods:

But both of these methods have big drawbacks.

Actor Critic. We’ll using two neural networks:

The quest for a better learning model

The problem with Policy Gradients

to have an optimal policy, we need a lot of samples. This produces slow learning, because it takes a lot of time to converge.

Introducing Actor Critic

Instead of waiting until the end of the episode as we do in Monte Carlo REINFORCE, we make an update at each step (TD Learning).

a2c-eq1

Because we do an update at each time step, we can’t use the total rewards R(t). Instead, we need to train a Critic model that approximates the value function (remember that value function calculates what is the maximum expected future reward given a state and an action). This value function replaces the reward function in policy gradient that calculates the rewards only at the end of the episode.

How Actor Critic works

You try some action randomly. The Critic observes your action and provides feedback.

Learning from this feedback, you’ll update your policy and be better at playing that game.

a2c-eq2

Because we have two models (Actor and Critic) that must be trained, it means that we have two set of weights (𝜃 for our action and w for our Critic) that must be optimized separately:

a2c-eq3

A2C and A3C

Introducing the Advantage function to stabilize learning

value-based methods have high variability.

To reduce this problem, we spoke about using the advantage function instead of the value function.

a2c-eq4

This function will tell us the improvement compared to the average the action taken at that state is. In other words, this function calculates the extra reward I get if I take this action. The extra reward is that beyond the expected value of that state.

we can use the TD error as a good estimator of the advantage function.

a2c-eq5

Two different strategies: Asynchronous or Synchronous

In A3C, we don’t use experience replay as this requires lot of memory. Instead, we asynchronously execute different agents in parallel on multiple instances of the environment. Each worker (copy of the network) will update the global network asynchronously.

only difference in A2C is that we synchronously update the global network. We wait until all workers have finished their training and calculated their gradients to average them, to update our global network.

Choosing A2C or A3C ?

a2c-fig1

Implementing an A2C agent that plays Sonic the Hedgehog

A2C in practice

When the runner takes a step (single step model), this performs a step for each of the n environments. This outputs a batch of experience.

Then we compute the gradient all at once using train_model and our batch of experience.

Finally, we update the step model with the new weights.

Remember that computing the gradient all at once is the same thing as collecting data, calculating the gradient for each worker, and then averaging. Why? Because summing the derivatives (summing of gradients) is the same thing as taking the derivatives of the sum. But the second one is more elegant and a better way to use GPU.

A2C with Sonic the Hedgehog

代码来源