Skip to the content.

Monte Carlo Methods


Contact me

Blog -> https://cugtyt.github.io/blog/index
Email -> cugtyt@qq.com GitHub -> Cugtyt@GitHub


本系列博客主页及相关见此处
内容来源


Part 0: Explore BlackjackEnv

环境的状态包括3个信息:

agent有两种行为:

print(env.observation_space) # Tuple(Discrete(32), Discrete(11), Discrete(2))
print(env.action_space) # Discrete(2)

玩一下

for i_episode in range(3):
    state = env.reset()
    while True:
        print(state)
        # 随机采取行为
        action = env.action_space.sample()
        state, reward, done, info = env.step(action)
        if done:
            print('End game! Reward: ', reward)
            print('You won :)\n') if reward > 0 else print('You lost :(\n')
            break

Part 1: MC Prediction

我们先从一个策略开始,玩家在当前值超过18的时候几乎总是STICK。具体来说,如果值大于18,以80%概率选择STICK,如果小于等于18,以80%概率选择HIT。

输入:

输出:

def generate_episode_from_limit_stochastic(bj_env):
    episode = []
    state = bj_env.reset()
    while True:
        # 预先设置的策略
        probs = [0.8, 0.2] if state[0] > 18 else [0.2, 0.8]
        # 根据策略选择行为
        action = np.random.choice(np.arange(2), p=probs)
        # 玩游戏,收集episode
        next_state, reward, done, info = bj_env.step(action)
        episode.append((state, action, reward))
        state = next_state
        if done:
            break
    return episode

测试一下:

for i in range(3):
    print(generate_episode_from_limit_stochastic(env))

# output:
# [((17, 7, False), 0, -1.0)]
# [((20, 8, False), 0, 1.0)]
# [((16, 5, True), 1, 0), ((16, 5, False), 1, -1)]

现在实现MC预测,输入:

输出:

def mc_prediction_q(env, num_episodes, generate_episode, gamma=1.0):
    # initialize empty dictionaries of arrays
    returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
    N = defaultdict(lambda: np.zeros(env.action_space.n))
    Q = defaultdict(lambda: np.zeros(env.action_space.n))
    # loop over episodes
    for i_episode in range(1, num_episodes+1):
        # monitor progress
        if i_episode % 1000 == 0:
            print("\rEpisode {}/{}.".format(i_episode, num_episodes), end="")
            sys.stdout.flush()
        # generate an episode
        episode = generate_episode(env)
        # obtain the states, actions, and rewards
        states, actions, rewards = zip(*episode)
        # prepare for discounting
        discounts = np.array([gamma**i for i in range(len(rewards)+1)])
        # update the sum of the returns, number of visits, and action-value 
        # function estimates for each state-action pair in the episode
        for i, state in enumerate(states):
            returns_sum[state][actions[i]] += sum(rewards[i:]*discounts[:-(1+i)])
            N[state][actions[i]] += 1.0
            Q[state][actions[i]] = returns_sum[state][actions[i]] / N[state][actions[i]]
    return Q