Exercise Monte Carlo with Frozenlake from gym API

Introduction

In this exercise you will learn techniques based on Monte Carlo estimators to solve reinforcement learning problems in which you don't know the environmental behavior. This stands in contrast to the gridworld examble seen before, where the full behavior of the environment was known and could be modeled. Here we have to learn based on an episode by episode strategy and estimate the state-action values over many episodes to find an optimal/good policy. As an examble for this we consider the frozenlake environment provided by the gym API. The fozenlake environment is represented by a 4x4 grid consisting of a start grid , some hole grids and one goal grid. As in the gridworld examble the agent can move, up, down, right and left. As further difficulty the grid is also slippery, meaning that an action might not lead to the desired direction. The rewards are set as follows, 0 for each transition not entering the goal grid and +1 for reaching the goal grid.

Requirements

Knowledge

To solve this notebook you should aquire knowledge about Monte Carlo methods, agent-environment interaction, state-action values and policy improvement.

Prerequisites

Read SUT18 chapter 5 to gain knowledge about the mentioned topics and terms. SUT18 is the standard literature for reinforcment learning and the basis for this and following notebooks.

Python Modules

import gym
import numpy as np
import matplotlib.pyplot as plt

Frozenlake environment

Below you will find some basic commands on how to work with the frozenlake environment.

# creating the environment
env = gym.make('FrozenLake-v0')
print("Sizes")
print("-----")
print("Action space: ", env.action_space)
print("Observation space: ", env.observation_space)
env.reset() #reset the environment the set agent to start state
env.render() #display the agents current state on the grid

The abbreviations stand for

  • S: Start
  • F: Frozen (safe)
  • H: Hole
  • G: Goal.

The actions are numerated as left (0), right (1), down (2), up (3).

A generic random walk until a terminal state is reached (done == True) is implemented below.

Such a walk, until a terminal state is reached, represents one episode.

env.reset()
for i in range(10):
    random_action = env.action_space.sample() #samples a random action
    print("Action:",random_action)
    new_state, reward, done, info = env.step(random_action) #agent takes action (random_action)
    
    """
        new_state: new state after action (random_action) taken in current state 
        reward: reward obtained after taken action (random_action) in current state and entering new state (new_state)
        done: bool flag, true if goal or hole is reached
        info: slippery probability, baseline is 1/3
    
    """
    env.render() # display current agent state
    if done:
        break

The states are internally enumerate as follows.

env.reset()
env.render()

print(" -- -- -- --")
print("|1 |2 |3 |4 |")
print(" -- -- -- --")
print("|5 |6 |7 |8 |")
print(" -- -- -- --")
print("|9 |10|11|12|")
print(" -- -- -- --")
print("|13|14|15|16|")
print(" -- -- -- --")

Terminal states are 6,8,12,13 (Holes) and 16 (Goal). The start state is always 1.

Testing a policy

The problem is said to be solved, a good policy found, if over a sufficient number of episodes (>100) a mean reward of >0.7 is reached. The code below tests this for a given policy.

def test_performance(policy, nb_episodes=100):
    sum_returns = 0
    for i in range(nb_episodes):
        state  = env.reset()
        done = False
        while not done:
            action = policy(state)
            state, reward, done, info = env.step(action)
            if done:
                sum_returns += reward
    return sum_returns/nb_episodes

Lets set and test a random policy. Since later we will work with a$ \epsilon-policy $ we always will start by defining a policy as a dictionary (state-action pairs) or array (index-value pairs) and than wrap the dictionary or array into a function.

policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} #random policy
policy = lambda s: policy_dict[s]

# calling dictionary with [] and function with ()

print("Mean reward:",test_performance(policy))

We see that the number is far lower than 0.7 and therefor the policy is not good/ optimal.

Exercises

Monte Carlo (every visit) with$ \epsilon $ policy

The goal of the following exercies is to set up a learning algorihtm using a Monte Carlo method based on the every visit estimate for the state-action values combined with an$ \epsilon $ policy.

policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} #random
policy = lambda s: policy_dict[s]

Task 1

Since Monte Carlo methods are based on learning from episodes write a function random_episode that generates an eposide given the frozenlake environment and a policy.

def random_episode(env,policy):
    
    """
    Args:
        env: given enviroment, here frozenlake
        policy: certain policy for the agent 
        
    Return:
        episode: list of (state,action, reward) tribles
    
    """
    
    return NotImplementedError

Task 2

As said before we will work with a policy dictionary or array wraped in a function. Write a function epsilon_policy that wraps a policy dictionary or array into an$ \epsilon - policy $ function $ \epsilon-greedy-policy $).

def epsilon_policy(policy_dict, epsilon=0.2, env=env):
    
    """
    Args:
        policy_dict: policy dictionary
        epsilon: epsilon paramter, decision boundary for exploration vs explotation
        env: enviroment
        
    Return: 
        epsilon_greedy_policy 
    
    """
    
    def epsilon_greedy_policy(s):
        """
        Args:
            s: state s
            
        Return:
            a: action based on an epsilon greedy method using the policy dict
        
        """
        return NotImplementedError
      
    
    return epsilon_greedy_policy

Tie-breaking argmax

As mentioned in the algorithms in SUT18 it is important to use a tie-breaking argmax. The numpy argmax naturally will only find the first biggest number in an array and therefore given a certain q-table prefers specific actions just based on their position, respectively numeration inside that q-table. To avoid this problem use the random_argmax_axis1 function given below for your implementation instead of the numpy argmax to improve learning.

def random_argmax_axis1(b):
    """ a random tie-breaking argmax"""
    return np.argmax(np.random.random(b.shape) * (b.T==b.max(axis=1)).T, axis=1)

Task 3

SUT18 provides many Monte Carlo based learning algorithm techniques including, exploring starts (ES),$ \epsilon $-greedy, first-visit and every-visit estimation. Why are ES and first-visit not well suited for the frozenlake environment?

Task 4

Write a function MC_epsilon_greedy_every_visit that performs policy improvement based on Monte Carlo every vist estimation for the state-action values and use the epsilon_policy to wrap the policy dictionary/ array.

Caution: If you follow the pseudo code algorithms presented in SUT18 it is important you run reverse through your generated episodes, see Appendix task 5.

def MC_epsilon_greedy_every_visit(env,nb_eps,policy_dict,gamma=0.95):
    """
    Args:
        env: enviroment
        nb_eps: number of episodes used to train
        policy_dict: policy dictionary
    Kwargs:
        gamma: discount rate
    
    Returns:
        policy: policy as a function
        greedy_policy: policy as array, final policy to test 
        Q_sa: q-table
        returns_Q_sa: dictionary for states-actions paris with cumulative return and number of visits information
    
    """
    # state-action values Q(s,a) for 16 states and 4 actions, q-table
    Q_sa = np.zeros((16,4))
    
    # set dictionary for (s,a) pairs [keys], cumulative return and number of visits (return_sum,nb_visits) [values]
    returns_Q_sa = dict()
    
    # get epsilon policy 
    policy = epsilon_policy(policy_dict)
    
    #for i in range(nb_eps):
        # get episode
        
        #for e in reversed(episode):
            # calculated reward and number of visits 
            # add to dictionary
            # calculate q-table
            
        
        
        # find current best policy with argmax

        # set greedy policy 
        
    return NotImplemented

Training

Execute the cell below to train the agent.

policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} # initialize random policy
policy,greedy_policy,Q_sa,returns_Q_sa = MC_epsilon_greedy_every_visit(env,100000,policy_dict=policy_dict,gamma=0.95)
print(greedy_policy)

Test performance after training

Since each performance test might lead to a different value accordingly to the random nature of our frozenlake environment due to its slippery condition, it is usefull to find the mean over many tests. Reaching over 0.7 means your learning was succesful.

policy = lambda s: greedy_policy[s]
mean_performance_list = []
mean_performance = 0
n = 300
for i in range(n):
    performance = test_performance(policy)
    mean_performance_list.append(performance)
    mean_performance += performance
    
print(mean_performance/n)
plt.hist(mean_performance_list,bins=30)
plt.grid()
plt.xlabel("Policy performance")
plt.show()

If you have some time and want to convince yourself that because of the random nature of the enviroment different policies can lead to a good performance run the code below. This code will repeatively train and evaluate the policy performance.

for i in range (10):
    # Training
    policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} # initialize random policy
    policy,greedy_policy,Q_sa,returns_Q_sa = MC_epsilon_greedy_every_visit(env,100000,policy_dict=policy_dict,gamma=0.95)
    print(greedy_policy)
    
    # Testing
    policy = lambda s: greedy_policy[s]
    n = 300
    mean_performance = 0
    for i in range(n):
        performance = test_performance(policy)
        mean_performance += performance
    
    print(mean_performance/n)
    print()

Appendix

Calculating cumulative reward

The cumulativ reward is defined as

$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+2} + \gamma^3 R_{t+3} + ... $.

Task 5

Given a list of random returns$ R $ calculate the cumulative return$ G $ using one loop running normally through the list and one running reversed through the list. Whats the difference.

rand_return_list = np.random.randint(20,size=(20))
print(rand_return_list)
gamma = 0.9

#code here
for r in reversed(rand_return_list):
    # code here
print(G)    

# code here
for r in rand_return_list:
   # code here
print(G)

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

exercise-monte-carlo-frozenlake-gym

Oliver Fischer

is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2019 Oliver Fischer

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.