Exercise Monte Carlo with Frozenlake from gym API
Table of Contents
Introduction
In this exercise you will learn techniques based on Monte Carlo estimators to solve reinforcement learning problems in which you don't know the environmental behavior. This stands in contrast to the gridworld examble seen before, where the full behavior of the environment was known and could be modeled. Here we have to learn based on an episode by episode strategy and estimate the state-action values over many episodes to find an optimal/good policy. As an examble for this we consider the frozenlake environment provided by the gym API. The fozenlake environment is represented by a 4x4 grid consisting of a start grid , some hole grids and one goal grid. As in the gridworld examble the agent can move, up, down, right and left. As further difficulty the grid is also slippery, meaning that an action might not lead to the desired direction. The rewards are set as follows, 0 for each transition not entering the goal grid and +1 for reaching the goal grid.
Requirements
Knowledge
To solve this notebook you should aquire knowledge about Monte Carlo methods, agent-environment interaction, state-action values and policy improvement.
Prerequisites
Read SUT18
chapter 5 to gain knowledge about the mentioned topics and terms.
SUT18
is the standard literature for reinforcment learning and the basis for this and following notebooks.
Python Modules
import gym
import numpy as np
import matplotlib.pyplot as plt
Frozenlake environment
Below you will find some basic commands on how to work with the frozenlake environment.
# creating the environment
env = gym.make('FrozenLake-v0')
print("Sizes")
print("-----")
print("Action space: ", env.action_space)
print("Observation space: ", env.observation_space)
env.reset() #reset the environment the set agent to start state
env.render() #display the agents current state on the grid
The abbreviations stand for
S
: StartF
: Frozen (safe)H
: HoleG
: Goal.
The actions are numerated as left (0), right (1), down (2), up (3).
A generic random walk until a terminal state is reached (done == True) is implemented below.
Such a walk, until a terminal state is reached, represents one episode.
env.reset()
for i in range(10):
random_action = env.action_space.sample() #samples a random action
print("Action:",random_action)
new_state, reward, done, info = env.step(random_action) #agent takes action (random_action)
"""
new_state: new state after action (random_action) taken in current state
reward: reward obtained after taken action (random_action) in current state and entering new state (new_state)
done: bool flag, true if goal or hole is reached
info: slippery probability, baseline is 1/3
"""
env.render() # display current agent state
if done:
break
The states are internally enumerate as follows.
env.reset()
env.render()
print(" -- -- -- --")
print("|1 |2 |3 |4 |")
print(" -- -- -- --")
print("|5 |6 |7 |8 |")
print(" -- -- -- --")
print("|9 |10|11|12|")
print(" -- -- -- --")
print("|13|14|15|16|")
print(" -- -- -- --")
Terminal states are 6,8,12,13 (Holes) and 16 (Goal). The start state is always 1.
Testing a policy
The problem is said to be solved, a good policy found, if over a sufficient number of episodes (>100) a mean reward of >0.7 is reached. The code below tests this for a given policy.
def test_performance(policy, nb_episodes=100):
sum_returns = 0
for i in range(nb_episodes):
state = env.reset()
done = False
while not done:
action = policy(state)
state, reward, done, info = env.step(action)
if done:
sum_returns += reward
return sum_returns/nb_episodes
Lets set and test a random policy. Since later we will work with a$ \epsilon-policy $ we always will start by defining a policy as a dictionary (state-action pairs) or array (index-value pairs) and than wrap the dictionary or array into a function.
policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} #random policy
policy = lambda s: policy_dict[s]
# calling dictionary with [] and function with ()
print("Mean reward:",test_performance(policy))
We see that the number is far lower than 0.7 and therefor the policy is not good/ optimal.
Exercises
Monte Carlo (every visit) with$ \epsilon $ policy
The goal of the following exercies is to set up a learning algorihtm using a Monte Carlo method based on the every visit estimate for the state-action values combined with an$ \epsilon $ policy.
policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} #random
policy = lambda s: policy_dict[s]
Task 1
Since Monte Carlo methods are based on learning from episodes write a function random_episode
that generates an eposide given the frozenlake environment and a policy.
def random_episode(env,policy):
"""
Args:
env: given enviroment, here frozenlake
policy: certain policy for the agent
Return:
episode: list of (state,action, reward) tribles
"""
return NotImplementedError
Task 2
As said before we will work with a policy dictionary or array wraped in a function. Write a function epsilon_policy
that wraps a policy dictionary or array into an$ \epsilon - policy $ function $ \epsilon-greedy-policy $).
def epsilon_policy(policy_dict, epsilon=0.2, env=env):
"""
Args:
policy_dict: policy dictionary
epsilon: epsilon paramter, decision boundary for exploration vs explotation
env: enviroment
Return:
epsilon_greedy_policy
"""
def epsilon_greedy_policy(s):
"""
Args:
s: state s
Return:
a: action based on an epsilon greedy method using the policy dict
"""
return NotImplementedError
return epsilon_greedy_policy
Tie-breaking argmax
As mentioned in the algorithms in SUT18
it is important to use a tie-breaking argmax. The numpy argmax naturally will only find the first biggest number in an array and therefore given a certain q-table prefers specific actions just based on their position, respectively numeration inside that q-table. To avoid this problem use the random_argmax_axis1
function given below for your implementation instead of the numpy argmax to improve learning.
def random_argmax_axis1(b):
""" a random tie-breaking argmax"""
return np.argmax(np.random.random(b.shape) * (b.T==b.max(axis=1)).T, axis=1)
Task 3
SUT18
provides many Monte Carlo based learning algorithm techniques including, exploring starts (ES),$ \epsilon $-greedy, first-visit and every-visit estimation. Why are ES and first-visit not well suited for the frozenlake environment?
Task 4
Write a function MC_epsilon_greedy_every_visit
that performs policy improvement based on Monte Carlo every vist estimation for the state-action values and use the epsilon_policy
to wrap the policy dictionary/ array.
Caution:
If you follow the pseudo code algorithms presented in SUT18
it is important you run reverse through your generated episodes, see Appendix task 5.
def MC_epsilon_greedy_every_visit(env,nb_eps,policy_dict,gamma=0.95):
"""
Args:
env: enviroment
nb_eps: number of episodes used to train
policy_dict: policy dictionary
Kwargs:
gamma: discount rate
Returns:
policy: policy as a function
greedy_policy: policy as array, final policy to test
Q_sa: q-table
returns_Q_sa: dictionary for states-actions paris with cumulative return and number of visits information
"""
# state-action values Q(s,a) for 16 states and 4 actions, q-table
Q_sa = np.zeros((16,4))
# set dictionary for (s,a) pairs [keys], cumulative return and number of visits (return_sum,nb_visits) [values]
returns_Q_sa = dict()
# get epsilon policy
policy = epsilon_policy(policy_dict)
#for i in range(nb_eps):
# get episode
#for e in reversed(episode):
# calculated reward and number of visits
# add to dictionary
# calculate q-table
# find current best policy with argmax
# set greedy policy
return NotImplemented
Training
Execute the cell below to train the agent.
policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} # initialize random policy
policy,greedy_policy,Q_sa,returns_Q_sa = MC_epsilon_greedy_every_visit(env,100000,policy_dict=policy_dict,gamma=0.95)
print(greedy_policy)
Test performance after training
Since each performance test might lead to a different value accordingly to the random nature of our frozenlake environment due to its slippery condition, it is usefull to find the mean over many tests. Reaching over 0.7 means your learning was succesful.
policy = lambda s: greedy_policy[s]
mean_performance_list = []
mean_performance = 0
n = 300
for i in range(n):
performance = test_performance(policy)
mean_performance_list.append(performance)
mean_performance += performance
print(mean_performance/n)
plt.hist(mean_performance_list,bins=30)
plt.grid()
plt.xlabel("Policy performance")
plt.show()
If you have some time and want to convince yourself that because of the random nature of the enviroment different policies can lead to a good performance run the code below. This code will repeatively train and evaluate the policy performance.
for i in range (10):
# Training
policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} # initialize random policy
policy,greedy_policy,Q_sa,returns_Q_sa = MC_epsilon_greedy_every_visit(env,100000,policy_dict=policy_dict,gamma=0.95)
print(greedy_policy)
# Testing
policy = lambda s: greedy_policy[s]
n = 300
mean_performance = 0
for i in range(n):
performance = test_performance(policy)
mean_performance += performance
print(mean_performance/n)
print()
Appendix
Calculating cumulative reward
The cumulativ reward is defined as
$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+2} + \gamma^3 R_{t+3} + ... $.
Task 5
Given a list of random returns$ R $ calculate the cumulative return$ G $ using one loop running normally through the list and one running reversed through the list. Whats the difference.
rand_return_list = np.random.randint(20,size=(20))
print(rand_return_list)
gamma = 0.9
#code here
for r in reversed(rand_return_list):
# code here
print(G)
# code here
for r in rand_return_list:
# code here
print(G)
Literature
Licenses
Notebook License (CC-BY-SA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
exercise-monte-carlo-frozenlake-gym
Oliver Fischer
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2019 Oliver Fischer
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.