来自Arthur Juliani Simple Reinforcement Learning with Tensorflow series Part 2 - Policy-based Agents
Full reinforcement agent
simple agent that is capable of taking in an observation of the world, and taking actions which provide the optimal reward not just in the present, but over the long run.
Environments which pose the full problem to an agent are referred to as Markov Decision Processes (MDPs).
Markov Decision Process
An MDP consists of a set of all possible states S from which our agent at any time will experience s. A set of all possible actions A from which our agent at any time will take action a. Given a state action pair (s, a), the transition probability to a new state s’ is defined by T(s, a), and the reward r is given by R(s, a). As such, at any time in an MDP, an agent is given a state s, takes action a, and receives new state s’ and reward r.
Cart-Pole Task
import tensorflow as tf
import tensorflow.contrib.slim as slim
import numpy as np
import gym
import matplotlib.pyplot as plt
%matplotlib inline
xrange = xrange
xrange = range
env = gym.make('CartPole-v0')
"""The Policy-Based Agent"""
gamma = 0.99
def discount_rewards(r):
""" take 1D float array of rewards and compute discounted reward """
discounted_r = np.zeros_like(r)
running_add = 0
for t in reversed(xrange(0, r.size)):
running_add = running_add * gamma + r[t]
discounted_r[t] = running_add
return discounted_r
class agent():
def __init__(self, lr, s_size,a_size,h_size):
#These lines established the feed-forward part of the network. The agent takes a state and produces an action.
self.state_in= tf.placeholder(shape=[None,s_size],dtype=tf.float32)
hidden = slim.fully_connected(self.state_in,h_size,biases_initializer=None,activation_fn=tf.nn.relu)
self.output = slim.fully_connected(hidden,a_size,activation_fn=tf.nn.softmax,biases_initializer=None)
self.chosen_action = tf.argmax(self.output,1)
#The next six lines establish the training proceedure. We feed the reward and chosen action into the network
#to compute the loss, and use it to update the network.
self.reward_holder = tf.placeholder(shape=[None],dtype=tf.float32)
self.action_holder = tf.placeholder(shape=[None],dtype=tf.int32)
self.indexes = tf.range(0, tf.shape(self.output)[0]) * tf.shape(self.output)[1] + self.action_holder
self.responsible_outputs = tf.gather(tf.reshape(self.output, [-1]), self.indexes)
self.loss = -tf.reduce_mean(tf.log(self.responsible_outputs)*self.reward_holder)
tvars = tf.trainable_variables()
self.gradient_holders = []
for idx,var in enumerate(tvars):
placeholder = tf.placeholder(tf.float32,name=str(idx)+'_holder')
self.gradients = tf.gradients(self.loss,tvars)
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
self.update_batch = optimizer.apply_gradients(zip(self.gradient_holders,tvars))