# Exercise Monte Carlo with Frozenlake from gym API

## Table of Contents

## Introduction

In this exercise you will learn techniques based on Monte Carlo estimators to solve reinforcement learning problems in which you don't know the environmental behavior. This stands in contrast to the gridworld examble seen before, where the full behavior of the environment was known and could be modeled. Here we have to learn based on an episode by episode strategy and estimate the state-action values over many episodes to find an optimal/good policy. As an examble for this we consider the frozenlake environment provided by the gym API. The fozenlake environment is represented by a 4x4 grid consisting of a start grid , some hole grids and one goal grid. As in the gridworld examble the agent can move, up, down, right and left. As further difficulty the grid is also slippery, meaning that an action might not lead to the desired direction. The rewards are set as follows, 0 for each transition not entering the goal grid and +1 for reaching the goal grid.

## Requirements

### Knowledge

To solve this notebook you should aquire knowledge about Monte Carlo methods, agent-environment interaction, state-action values and policy improvement.

### Prerequisites

Read `SUT18`

chapter 5 to gain knowledge about the mentioned topics and terms.
`SUT18`

is the standard literature for reinforcment learning and the basis for this and following notebooks.

## Python Modules

```
import gym
import numpy as np
import matplotlib.pyplot as plt
```

## Frozenlake environment

Below you will find some basic commands on how to work with the frozenlake environment.

```
# creating the environment
env = gym.make('FrozenLake-v0')
```

```
print("Sizes")
print("-----")
print("Action space: ", env.action_space)
print("Observation space: ", env.observation_space)
```

```
env.reset() #reset the environment the set agent to start state
env.render() #display the agents current state on the grid
```

The abbreviations stand for

`S`

: Start`F`

: Frozen (safe)`H`

: Hole`G`

: Goal.

The actions are numerated as **left (0), right (1), down (2), up (3)**.

A generic random walk until a terminal state is reached (done == True) is implemented below.

Such a walk, until a terminal state is reached, represents one episode.

```
env.reset()
for i in range(10):
random_action = env.action_space.sample() #samples a random action
print("Action:",random_action)
new_state, reward, done, info = env.step(random_action) #agent takes action (random_action)
"""
new_state: new state after action (random_action) taken in current state
reward: reward obtained after taken action (random_action) in current state and entering new state (new_state)
done: bool flag, true if goal or hole is reached
info: slippery probability, baseline is 1/3
"""
env.render() # display current agent state
if done:
break
```

The states are internally enumerate as follows.

```
env.reset()
env.render()
print(" -- -- -- --")
print("|1 |2 |3 |4 |")
print(" -- -- -- --")
print("|5 |6 |7 |8 |")
print(" -- -- -- --")
print("|9 |10|11|12|")
print(" -- -- -- --")
print("|13|14|15|16|")
print(" -- -- -- --")
```

Terminal states are 6,8,12,13 (Holes) and 16 (Goal). The start state is always 1.

## Testing a policy

The problem is said to be solved, a good policy found, if over a sufficient number of episodes (>100) a mean reward of >0.7 is reached. The code below tests this for a given policy.

```
def test_performance(policy, nb_episodes=100):
sum_returns = 0
for i in range(nb_episodes):
state = env.reset()
done = False
while not done:
action = policy(state)
state, reward, done, info = env.step(action)
if done:
sum_returns += reward
return sum_returns/nb_episodes
```

Lets set and test a random policy. Since later we will work with a$ \epsilon-policy $ we always will start by defining a policy as a dictionary (state-action pairs) or array (index-value pairs) and than wrap the dictionary or array into a function.

```
policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} #random policy
policy = lambda s: policy_dict[s]
# calling dictionary with [] and function with ()
print("Mean reward:",test_performance(policy))
```

We see that the number is far lower than 0.7 and therefor the policy is not good/ optimal.

# Exercises

## Monte Carlo (every visit) with$ \epsilon $ policy

The goal of the following exercies is to set up a learning algorihtm using a **Monte Carlo method** based on the **every visit** estimate for the state-action values combined with an$ \epsilon $ policy.

```
policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} #random
policy = lambda s: policy_dict[s]
```

### Task 1

Since Monte Carlo methods are based on learning from episodes write a function `random_episode`

that generates an eposide given the frozenlake environment and a policy.

```
def random_episode(env,policy):
"""
Args:
env: given enviroment, here frozenlake
policy: certain policy for the agent
Return:
episode: list of (state,action, reward) tribles
"""
return NotImplementedError
```

### Task 2

As said before we will work with a policy dictionary or array wraped in a function. Write a function `epsilon_policy`

that wraps a policy dictionary or array into an$ \epsilon - policy $ function $ \epsilon-greedy-policy $).

```
def epsilon_policy(policy_dict, epsilon=0.2, env=env):
"""
Args:
policy_dict: policy dictionary
epsilon: epsilon paramter, decision boundary for exploration vs explotation
env: enviroment
Return:
epsilon_greedy_policy
"""
def epsilon_greedy_policy(s):
"""
Args:
s: state s
Return:
a: action based on an epsilon greedy method using the policy dict
"""
return NotImplementedError
return epsilon_greedy_policy
```

### Tie-breaking argmax

As mentioned in the algorithms in `SUT18`

it is important to use a tie-breaking argmax. The numpy argmax naturally will only find the first biggest number in an array and therefore given a certain q-table prefers specific actions just based on their position, respectively numeration inside that q-table. To avoid this problem use the `random_argmax_axis1`

function given below for your implementation instead of the numpy argmax to improve learning.

```
def random_argmax_axis1(b):
""" a random tie-breaking argmax"""
return np.argmax(np.random.random(b.shape) * (b.T==b.max(axis=1)).T, axis=1)
```

### Task 3

`SUT18`

provides many Monte Carlo based learning algorithm techniques including, exploring starts (ES),$ \epsilon $-greedy, first-visit and every-visit estimation. Why are ES and first-visit not well suited for the frozenlake environment?

### Task 4

Write a function `MC_epsilon_greedy_every_visit`

that performs policy improvement based on Monte Carlo every vist estimation for the state-action values and use the `epsilon_policy`

to wrap the policy dictionary/ array.

**Caution:**
If you follow the pseudo code algorithms presented in `SUT18`

it is important you run reverse through your generated episodes, see Appendix task 5.

```
def MC_epsilon_greedy_every_visit(env,nb_eps,policy_dict,gamma=0.95):
"""
Args:
env: enviroment
nb_eps: number of episodes used to train
policy_dict: policy dictionary
Kwargs:
gamma: discount rate
Returns:
policy: policy as a function
greedy_policy: policy as array, final policy to test
Q_sa: q-table
returns_Q_sa: dictionary for states-actions paris with cumulative return and number of visits information
"""
# state-action values Q(s,a) for 16 states and 4 actions, q-table
Q_sa = np.zeros((16,4))
# set dictionary for (s,a) pairs [keys], cumulative return and number of visits (return_sum,nb_visits) [values]
returns_Q_sa = dict()
# get epsilon policy
policy = epsilon_policy(policy_dict)
#for i in range(nb_eps):
# get episode
#for e in reversed(episode):
# calculated reward and number of visits
# add to dictionary
# calculate q-table
# find current best policy with argmax
# set greedy policy
return NotImplemented
```

## Training

Execute the cell below to train the agent.

```
policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} # initialize random policy
policy,greedy_policy,Q_sa,returns_Q_sa = MC_epsilon_greedy_every_visit(env,100000,policy_dict=policy_dict,gamma=0.95)
```

`print(greedy_policy)`

## Test performance after training

Since each performance test might lead to a different value accordingly to the random nature of our frozenlake environment due to its slippery condition, it is usefull to find the mean over many tests. Reaching over 0.7 means your learning was succesful.

`policy = lambda s: greedy_policy[s]`

```
mean_performance_list = []
mean_performance = 0
n = 300
for i in range(n):
performance = test_performance(policy)
mean_performance_list.append(performance)
mean_performance += performance
print(mean_performance/n)
```

```
plt.hist(mean_performance_list,bins=30)
plt.grid()
plt.xlabel("Policy performance")
plt.show()
```

If you have some time and want to convince yourself that because of the random nature of the enviroment different policies can lead to a good performance run the code below. This code will repeatively train and evaluate the policy performance.

```
for i in range (10):
# Training
policy_dict = {0:1, 1:2, 2:1, 3:0, 4:1, 6:1, 8:2, 9:0, 10:1, 13:2, 14:2} # initialize random policy
policy,greedy_policy,Q_sa,returns_Q_sa = MC_epsilon_greedy_every_visit(env,100000,policy_dict=policy_dict,gamma=0.95)
print(greedy_policy)
# Testing
policy = lambda s: greedy_policy[s]
n = 300
mean_performance = 0
for i in range(n):
performance = test_performance(policy)
mean_performance += performance
print(mean_performance/n)
print()
```

## Appendix

### Calculating cumulative reward

The cumulativ reward is defined as

$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+2} + \gamma^3 R_{t+3} + ... $.

### Task 5

Given a list of random returns$ R $ calculate the cumulative return$ G $ using one loop running normally through the list and one running reversed through the list. Whats the difference.

```
rand_return_list = np.random.randint(20,size=(20))
print(rand_return_list)
gamma = 0.9
#code here
for r in reversed(rand_return_list):
# code here
print(G)
# code here
for r in rand_return_list:
# code here
print(G)
```

## Literature

## Licenses

### Notebook License (CC-BY-SA 4.0)

*The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).*

exercise-monte-carlo-frozenlake-gym

Oliver Fischer

is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Based on a work at https://gitlab.com/deep.TEACHING.

### Code License (MIT)

*The following license only applies to code cells of the notebook.*

Copyright 2019 Oliver Fischer

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.