# Exercise Monte Carlo with Frozenlake from gym API

## Introduction

In this exercise you will learn techniques based on Monte Carlo estimators to solve reinforcement learning problems in which you don't know the environmental behavior. This stands in contrast to the gridworld examble seen before, where the full behavior of the environment was known and could be modeled. Here we have to learn based on an episode by episode strategy and estimate the state-action values over many episodes to find an optimal/good policy. As an examble for this we consider the frozenlake environment provided by the gym API. The fozenlake environment is represented by a 4x4 grid consisting of a start grid , some hole grids and one goal grid. As in the gridworld examble the agent can move, up, down, right and left. As further difficulty the grid is also slippery, meaning that an action might not lead to the desired direction. The rewards are set as follows, 0 for each transition not entering the goal grid and +1 for reaching the goal grid.

## Requirements

### Knowledge

To solve this notebook you should aquire knowledge about Monte Carlo methods, agent-environment interaction, state-action values and policy improvement.

### Prerequisites

Read SUT18 chapter 5 to gain knowledge about the mentioned topics and terms. SUT18 is the standard literature for reinforcment learning and the basis for this and following notebooks.

## Python Modules

import gym
import numpy as np
import matplotlib.pyplot as plt

## Frozenlake environment

Below you will find some basic commands on how to work with the frozenlake environment.

# creating the environment
env = gym.make('FrozenLake-v0')
print("Sizes")
print("-----")
print("Action space: ", env.action_space)
print("Observation space: ", env.observation_space)
env.reset() #reset the environment the set agent to start state
env.render() #display the agents current state on the grid

The abbreviations stand for

• S: Start
• F: Frozen (safe)
• H: Hole
• G: Goal.

The actions are numerated as left (0), right (1), down (2), up (3).

A generic random walk until a terminal state is reached (done == True) is implemented below.

Such a walk, until a terminal state is reached, represents one episode.

env.reset()
for i in range(10):
random_action = env.action_space.sample() #samples a random action
print("Action:",random_action)
new_state, reward, done, info = env.step(random_action) #agent takes action (random_action)

"""
new_state: new state after action (random_action) taken in current state
reward: reward obtained after taken action (random_action) in current state and entering new state (new_state)
done: bool flag, true if goal or hole is reached
info: slippery probability, baseline is 1/3

"""
env.render() # display current agent state
if done:
break

The states are internally enumerate as follows.

env.reset()
env.render()

print(" -- -- -- --")
print("|1 |2 |3 |4 |")
print(" -- -- -- --")
print("|5 |6 |7 |8 |")
print(" -- -- -- --")
print("|9 |10|11|12|")
print(" -- -- -- --")
print("|13|14|15|16|")
print(" -- -- -- --")

Terminal states are 6,8,12,13 (Holes) and 16 (Goal). The start state is always 1.

## Testing a policy

The problem is said to be solved, a good policy found, if over a sufficient number of episodes (>100) a mean reward of >0.7 is reached. The code below tests this for a given policy.

def test_performance(policy, nb_episodes=100):
sum_returns = 0
for i in range(nb_episodes):
state  = env.reset()
done = False
while not done:
action = policy(state)
state, reward, done, info = env.step(action)
if done:
sum_returns += reward
return sum_returns/nb_episodes

Lets set and test a random policy. Since later we will work with a$\epsilon-policy$ we always will start by defining a policy as a dictionary (state-action pairs) or array (index-value pairs) and than wrap the dictionary or array into a function.

policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} #random policy
policy = lambda s: policy_dict[s]

# calling dictionary with [] and function with ()

print("Mean reward:",test_performance(policy))

We see that the number is far lower than 0.7 and therefor the policy is not good/ optimal.

# Exercises

## Monte Carlo (every visit) with$\epsilon$ policy

The goal of the following exercies is to set up a learning algorihtm using a Monte Carlo method based on the every visit estimate for the state-action values combined with an$\epsilon$ policy.

policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} #random
policy = lambda s: policy_dict[s]

Since Monte Carlo methods are based on learning from episodes write a function random_episode that generates an eposide given the frozenlake environment and a policy.

def random_episode(env,policy):

"""
Args:
env: given enviroment, here frozenlake
policy: certain policy for the agent

Return:
episode: list of (state,action, reward) tribles

"""

return NotImplementedError

As said before we will work with a policy dictionary or array wraped in a function. Write a function epsilon_policy that wraps a policy dictionary or array into an$\epsilon - policy$ function $\epsilon-greedy-policy$).

def epsilon_policy(policy_dict, epsilon=0.2, env=env):

"""
Args:
policy_dict: policy dictionary
epsilon: epsilon paramter, decision boundary for exploration vs explotation
env: enviroment

Return:
epsilon_greedy_policy

"""

def epsilon_greedy_policy(s):
"""
Args:
s: state s

Return:
a: action based on an epsilon greedy method using the policy dict

"""
return NotImplementedError

return epsilon_greedy_policy

### Tie-breaking argmax

As mentioned in the algorithms in SUT18 it is important to use a tie-breaking argmax. The numpy argmax naturally will only find the first biggest number in an array and therefore given a certain q-table prefers specific actions just based on their position, respectively numeration inside that q-table. To avoid this problem use the random_argmax_axis1 function given below for your implementation instead of the numpy argmax to improve learning.

def random_argmax_axis1(b):
""" a random tie-breaking argmax"""
return np.argmax(np.random.random(b.shape) * (b.T==b.max(axis=1)).T, axis=1)

SUT18 provides many Monte Carlo based learning algorithm techniques including, exploring starts (ES),$\epsilon$-greedy, first-visit and every-visit estimation. Why are ES and first-visit not well suited for the frozenlake environment?

Write a function MC_epsilon_greedy_every_visit that performs policy improvement based on Monte Carlo every vist estimation for the state-action values and use the epsilon_policy to wrap the policy dictionary/ array.

Caution: If you follow the pseudo code algorithms presented in SUT18 it is important you run reverse through your generated episodes, see Appendix task 5.

def MC_epsilon_greedy_every_visit(env,nb_eps,policy_dict,gamma=0.95):
"""
Args:
env: enviroment
nb_eps: number of episodes used to train
policy_dict: policy dictionary
Kwargs:
gamma: discount rate

Returns:
policy: policy as a function
greedy_policy: policy as array, final policy to test
Q_sa: q-table
returns_Q_sa: dictionary for states-actions paris with cumulative return and number of visits information

"""
# state-action values Q(s,a) for 16 states and 4 actions, q-table
Q_sa = np.zeros((16,4))

# set dictionary for (s,a) pairs [keys], cumulative return and number of visits (return_sum,nb_visits) [values]
returns_Q_sa = dict()

# get epsilon policy
policy = epsilon_policy(policy_dict)

#for i in range(nb_eps):
# get episode

#for e in reversed(episode):
# calculated reward and number of visits
# calculate q-table

# find current best policy with argmax

# set greedy policy

return NotImplemented

## Training

Execute the cell below to train the agent.

policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} # initialize random policy
policy,greedy_policy,Q_sa,returns_Q_sa = MC_epsilon_greedy_every_visit(env,100000,policy_dict=policy_dict,gamma=0.95)

print(greedy_policy)

## Test performance after training

Since each performance test might lead to a different value accordingly to the random nature of our frozenlake environment due to its slippery condition, it is usefull to find the mean over many tests. Reaching over 0.7 means your learning was succesful.

policy = lambda s: greedy_policy[s]
mean_performance_list = []
mean_performance = 0
n = 300
for i in range(n):
performance = test_performance(policy)
mean_performance_list.append(performance)
mean_performance += performance

print(mean_performance/n)
plt.hist(mean_performance_list,bins=30)
plt.grid()
plt.xlabel("Policy performance")
plt.show()

If you have some time and want to convince yourself that because of the random nature of the enviroment different policies can lead to a good performance run the code below. This code will repeatively train and evaluate the policy performance.

for i in range (10):
# Training
policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} # initialize random policy
policy,greedy_policy,Q_sa,returns_Q_sa = MC_epsilon_greedy_every_visit(env,100000,policy_dict=policy_dict,gamma=0.95)
print(greedy_policy)

# Testing
policy = lambda s: greedy_policy[s]
n = 300
mean_performance = 0
for i in range(n):
performance = test_performance(policy)
mean_performance += performance

print(mean_performance/n)
print()

## Appendix

### Calculating cumulative reward

The cumulativ reward is defined as

$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+2} + \gamma^3 R_{t+3} + ...$.

Given a list of random returns$R$ calculate the cumulative return$G$ using one loop running normally through the list and one running reversed through the list. Whats the difference.

rand_return_list = np.random.randint(20,size=(20))
print(rand_return_list)
gamma = 0.9

#code here
for r in reversed(rand_return_list):
# code here
print(G)

# code here
for r in rand_return_list:
# code here
print(G)

## Literature

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

exercise-monte-carlo-frozenlake-gym

Oliver Fischer