Policy-based Agents
Contact me
Blog -> https://cugtyt.github.io/blog/index
Email -> cugtyt@qq.com
GitHub -> Cugtyt@GitHub
本系列博客主页及相关见此处
来自Arthur Juliani Simple Reinforcement Learning with Tensorflow series Part 2 - Policy-based Agents
Full reinforcement agent
simple agent that is capable of taking in an observation of the world, and taking actions which provide the optimal reward not just in the present, but over the long run.
Environments which pose the full problem to an agent are referred to as Markov Decision Processes (MDPs).
Markov Decision Process
An MDP consists of a set of all possible states S from which our agent at any time will experience s. A set of all possible actions A from which our agent at any time will take action a. Given a state action pair (s, a), the transition probability to a new state s’ is defined by T(s, a), and the reward r is given by R(s, a). As such, at any time in an MDP, an agent is given a state s, takes action a, and receives new state s’ and reward r.
Cart-Pole Task
相较于前面的算法,我们需要做一些改动。每次用多个experience更新agent,我们需要收集experiences到一个buffer里,然后时不时一次性更新agent,这也被称作经验轨迹。
import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np
import gym
import matplotlib.pyplot as plt
%matplotlib inline
try:
xrange = xrange
except:
xrange = range
env = gym.make('CartPole-v0')
"""The Policy-Based Agent"""
gamma = 0.99
def discount_rewards(r):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(xrange(0, r.size)):
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
class agent():
def __init__(self, lr, s_size,a_size,h_size):
#These lines established the feed-forward part of the network. The agent takes a state and produces an action.
self.state_in= tf.placeholder(shape=[None,s_size],dtype=tf.float32)
hidden = slim.fully_connected(self.state_in,h_size,biases_initializer=None,activation_fn=tf.nn.relu)
self.output = slim.fully_connected(hidden,a_size,activation_fn=tf.nn.softmax,biases_initializer=None)
self.chosen_action = tf.argmax(self.output,1)
#The next six lines establish the training proceedure. We feed the reward and chosen action into the network
#to compute the loss, and use it to update the network.
self.reward_holder = tf.placeholder(shape=[None],dtype=tf.float32)
self.action_holder = tf.placeholder(shape=[None],dtype=tf.int32)
self.indexes = tf.range(0, tf.shape(self.output)[0]) * tf.shape(self.output)[1] + self.action_holder
self.responsible_outputs = tf.gather(tf.reshape(self.output, [-1]), self.indexes)
self.loss = -tf.reduce_mean(tf.log(self.responsible_outputs)*self.reward_holder)
tvars = tf.trainable_variables()
self.gradient_holders = []
for idx,var in enumerate(tvars):
placeholder = tf.placeholder(tf.float32,name=str(idx)+'_holder')
self.gradient_holders.append(placeholder)
self.gradients = tf.gradients(self.loss,tvars)
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
self.update_batch = optimizer.apply_gradients(zip(self.gradient_holders,tvars))