Monte Carlo Methods
Contact me
Blog -> https://cugtyt.github.io/blog/index
Email -> cugtyt@qq.com GitHub -> Cugtyt@GitHub
Part 0: Explore BlackjackEnv
环境的状态包括3个信息:
- 玩家的当前值 $\in {0, 1, \ldots, 31}$
- 卖家卡片 $\in {1, \ldots, 10}$
- 玩家时候有可用的ace
no
$=0$,yes
$=1$
agent有两种行为:
- STICK = 0
- HIT = 1
print(env.observation_space) # Tuple(Discrete(32), Discrete(11), Discrete(2))
print(env.action_space) # Discrete(2)
玩一下
for i_episode in range(3):
state = env.reset()
while True:
print(state)
# 随机采取行为
action = env.action_space.sample()
state, reward, done, info = env.step(action)
if done:
print('End game! Reward: ', reward)
print('You won :)\n') if reward > 0 else print('You lost :(\n')
break
Part 1: MC Prediction
我们先从一个策略开始,玩家在当前值超过18的时候几乎总是STICK。具体来说,如果值大于18,以80%概率选择STICK,如果小于等于18,以80%概率选择HIT。
输入:
- bj_env: 环境
输出:
- episode: (state, action, reward)的序列,对应与$(S_0, A_0, R_1, \ldots, S_{T-1}, A_{T-1}, R_{T})$。$T$ 是最终时间步,
episode[i]
返回值是$(S_i, A_i, R_{i+1})$。
def generate_episode_from_limit_stochastic(bj_env):
episode = []
state = bj_env.reset()
while True:
# 预先设置的策略
probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8]
# 根据策略选择行为
action = np.random.choice(np.arange(2), p=probs)
# 玩游戏,收集episode
next_state, reward, done, info = bj_env.step(action)
episode.append((state, action, reward))
state = next_state
if done:
break
return episode
测试一下:
for i in range(3):
print(generate_episode_from_limit_stochastic(env))
# output:
# [((17, 7, False), 0, -1.0)]
# [((20, 8, False), 0, 1.0)]
# [((16, 5, True), 1, 0), ((16, 5, False), 1, -1)]
现在实现MC预测,输入:
- env
- num_episodes: 玩的次数
- generate_episode: 返回一次episode的函数
- gamma: 折扣率,0-1之间,默认1
输出:
- Q: 一个字典,
Q[s][a]
表示状态s和行为a下的预测值
def mc_prediction_q(env, num_episodes, generate_episode, gamma=1.0):
# initialize empty dictionaries of arrays
returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
N = defaultdict(lambda: np.zeros(env.action_space.n))
Q = defaultdict(lambda: np.zeros(env.action_space.n))
# loop over episodes
for i_episode in range(1, num_episodes+1):
# monitor progress
if i_episode % 1000 == 0:
print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
sys.stdout.flush()
# generate an episode
episode = generate_episode(env)
# obtain the states, actions, and rewards
states, actions, rewards = zip(*episode)
# prepare for discounting
discounts = np.array([gamma**i for i in range(len(rewards)+1)])
# update the sum of the returns, number of visits, and action-value
# function estimates for each state-action pair in the episode
for i, state in enumerate(states):
returns_sum[state][actions[i]] += sum(rewards[i:]*discounts[:-(1+i)])
N[state][actions[i]] += 1.0
Q[state][actions[i]] = returns_sum[state][actions[i]] / N[state][actions[i]]
return Q