## Introduction

In this notebook you will learn the basic principles of reinforcment learning and see how it differs from supervised learning. Classicle supervised learning is purely based on learning a certain task like classification or regression given a set of training variables, features and labels. So while training you already know what to aim for. Reinforcment learning on the other hand is more "learning by doing". So at start you do not know what to aim for but instead you learn based on an action reward system. Each action you perform gives back a reward and your task is it to maximize the reward over time.

A quite good and easy to understand task to understand this interplay bewteen action and reward is to learn how to play bandit machines or a multi armend bandit machine. Each arm represents a distribution of rewards and your task will it be to learn which arm to choose in order to maximize the reward over many tries.

## Requirements

### Knowledge

• basic terms: action, reward, value function *$\epsilon-greedy$ method, incremental implementation

### Prerequisites

To solve this notebook read SUT18 until chapter 2.4. SUT18 is the standard literature for reinforcment learning and the basis for this and following notebooks.

### Python Modules

import numpy as np
import types
import matplotlib.pyplot as plt

%matplotlib inline

## Exercises

The following exercises are based on chapter 2.3 and 2.4 in [SUT18]. Although the problem to be solved is way older. In general, it is about using ressources to either:

• acquire knew knowledge ("exploration")
• or to optimize your actions based on existing knowledge ("exploitation").

Imagine you are in a casino seeing ten one armed bandits in front of you. You have a limited number of trials (e.g. 1.000) and at the beginning the only thing you know, is that each bandit gives another reward. The reward might be normally distributed, like in the picture below: • Choice among k (here 10) different options, or actions.
• After each choice you receive a numerical reward $R_t$ (from a stationary probability distribution, here normal distributions) depending on the action you selected.
• Objective: maximize the expected total reward over some time period, (e.g. over 1000 action selections).

### The Bandits

Before we start to solve our objective, we first need to create some bandits.

Write a function get_bandit_function which returns a function bandit_fct representing the bandit. bandit_fct returns the reward ,based on a reward distribution, given for a certain action (using a bandit arm). The means for all 10 bandit_fct distributions should be selected from a normal distribution with mean zero and unit variance and all bandit_fct distributions are normal distributions with these selected mean values and also unit variance.

np.random.seed(43)
bandit_means = np.random.normal(10,1,10)
bandit_means
#####################################
######## Code to generate picture ###
#####################################
plt.plot([0,11],[0,0],linestyle='--')
plt.plot(np.arange(10)+1,bandit_means,'ro',label='$Bandit \ means \ \mu$')
plt.errorbar(np.arange(10)+1,bandit_means,yerr=np.ones(10),fmt='none', label='Standard deviation $\ \sigma$')
plt.grid()
plt.title('Reward normal distributions of the bandits')
plt.ylim(min(bandit_means)-2, max(bandit_means)+2)
plt.xlabel('Bandit number')
plt.ylabel('Expected reward')
plt.legend()
plt.show()

### Solution

$\DeclareMathOperator*{\argmax}{arg\,max} % in your preamble \DeclareMathOperator*{\argmin}{arg\,min} % in your preamble$

def get_bandit_function(bandit_means, sigma = 1):
raise NotImplementedError()
### get a bandit (e.g. lambda function)
bandit_fct = get_bandit_function(bandit_means, sigma=1)
### make a call
print('Test action 4')
print(bandit_fct(3)) # index start at 0 so this is actually action nr.4
### This should not throw an exception if your implementation is correct
nb_samples = 100000
action_to_test = 3

assert isinstance(bandit_fct, types.FunctionType)
print('Test 1 passed')

mean_action_ = np.ndarray(nb_samples)
for i in range(nb_samples):
mean_action_[i] = bandit_fct(action_to_test)
np.testing.assert_almost_equal(mean_action_.mean(),bandit_means[action_to_test], decimal=2)
print('Test 2 passed')

### Notation

#### The Problem

On each of an infinite sequence of time steps, $t=1,2,3,…$, you choose an action $A_t$ from $k$ possibilities, and receive a real-valued reward $R_t$ The reward depends only on the action taken; it is indentically and independently distributed (i.i.d.)

The value then of an arbitrary action $a$ , denoted $q_∗(a)$ , is the expected reward given that action$a$ is selected:

$q_*(a) \dot = \mathbb E [ R_t | A_t = a], \forall a \in {1,2,\ldots,k}$

These values (true values) are unknown. The distribution is unknown.

Nevertheless, you must maximize your total reward.

You must both try actions to learn their values (explore), and prefer those that appear best (exploit)

#### Exploration / Exploitation Dilemma

Suppose you form estimates (action value estimates):

$Q_t(a) \approx q_∗(a) \forall a$

Define the greedy action at time $t$ as

$\DeclareMathOperator*{\argmax}{arg\,max} % in your preamble \DeclareMathOperator*{\argmin}{arg\,min} % in your preamble$

$A^∗_t := \argmax_a Q_t(a)$

• If $A_t=A^∗_t$ then you are exploiting.

• If $A_t \neq A^∗_t$ then you are exploring.

You can’t do both, but you need to do both

You can never stop exploring, but maybe you should explore less with time. Or maybe not.

#### Action Value Methods

Methods that learn Action-Value estimates and nothing else.

For example, estimate action-values as sample averages:

$Q_t(a) \dot{=} \frac{\text{sum of rewards when a taken prior to } t}{\text{number of times a taken prior to } t} = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_{i=a}}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_{i=a}}}$

The sample-average estimates converge to true values. If the action is taken an infinite number of times

$\lim_{N_t(a) \rightarrow \infty} Q_t(a) = q_*(a)$

with$N_t(a)$, the number of times action $a$ has been taken by time $t$.

#### Averaging

To calculate the estimate$Q_{n_a}(a)$ of an action$a$, we calculate the average after$n_a+1$ rewards:

$Q_{n_a}(a) \dot{=} \frac{R_1 + R_2 + \ldots + R_{n_a+1}}{n_a-1}$

This would require us to store all rewards. To save memory, we can calculate the so called running average, only storing the latest$Q_{n_a}(a)$ and$n_a$:

$Q_{n_a+1}(a) \dot{=} Q_{n_a}(a) + \frac{1}{n_a} [ R_{n_a} - Q_{n_a}(a) ]$

### The epsilon-greedy action selection

In greedy action selection, you always exploit.

In $\epsilon$ -greedy, you are usually greedy, but with probability $\epsilon$ you instead pick an action at random (also possibly the greedy action again).

This is perhaps the simplest way to balance exploration and exploitation

Implement the action-value method with the incremental implementation,$\epsilon$-greedy algorithm, for the 10 armend bandit problem.

a) implement the update rule for the action-values Q, update_Qs(action, reward, Qs, Ns)

b) write a function that plays a bandit and returns Q values, rewards and actions done play_a_bandit(epsilon, bandit_fct, optimistic=0., nb_steps = 1000, nb_arms=10):

c) to "analyze" your algorithm perform multiple runs, always using a new 10 armed bandit, and calculate the average reward and percentage of optimal choices for each step over these multiple runs, visualize your results for different$\epsilon$ values. Your plots should look like the following:  Hint:

For the last part use at least 500 runs for nicely shaped plots

### Solution a)

def update_Qs(action, reward, Qs, Ns):
raise NotImplementedError
### Needed fo the next cell to validate your implementation

def init_Ns_Qs_Rs():
np.random.seed(44)
Ns = [0 for i in range(10)]
Qs = [0 for i in range(10)]
Rs = [np.random.randint(0,10, size=np.random.randint(0,10, size=1)) for i in range(10)]
return Ns, Qs, Rs

Ns, Qs, Rs = init_Ns_Qs_Rs()
print(Ns)
print(Qs)
print(Rs)
### This cell should not throw an exception

Ns, Qs, Rs = init_Ns_Qs_Rs()

for action, rewards in enumerate(Rs):
print('action: ', action, 'rewards: ', rewards)
for r in rewards:
update_Qs(action, r, Qs, Ns)

print(Ns)
print(Qs)

assert Ns is 4
assert Ns is 4
assert Ns is 6
assert Ns is 5
assert Ns is 7
assert Ns is 8
assert Ns is 2
assert Ns is 2
assert Ns is 8
assert Ns is 5
assert Qs == 1.75
assert Qs == 6.25
assert Qs ==(5.166666666666667)
assert Qs == 5.4
assert Qs == 3.8571428571428577
assert Qs == 4.249999999999999
assert Qs == 1.5
assert Qs == 5.5
assert Qs == 5.125
assert Qs == 3.8

### Solution b)

def play_a_bandit(epsilon, bandit_fct, optimistic=0., nb_steps = 1000, nb_arms=10):
"""
Parameters
----------
epsilon : float
epsilon: probability of chosing a random action (exploration)

optimistic: float
Initial value for the Q-values (for all action)

nb_steps: int
Number of steps. How often the bandit is played

"""
raise NotImplementedError

return Qs, rewards, actions

### Solution c)

######################
### Your code here ###
######################

### Gradient Bandit Algorithm

Another approach to solve the Bandit problem is the Gradient Bandit Algorithm presented in chapter 2.8 [SUT18]

Instead of selecting only the best considdered bandit with probability$1 - \epsilon$ or uniformly any bandit with probability$\epsilon$, we can assign a preference$H_t(a)$ for each bandit, which we dynamically adjust after each time step$t$.

These$H_t(a)$ are then used to computed the probabilities which action (bandit)$A_t$ to chose at current time step$t$, with the softmax function:

$P(A_t = a) = \frac{\exp(H_t(a)}{\sum^k_{b=1}\exp(H_t(b)} = \pi_t(a)$

After we receive our reward$R_t$ with action$A_t$ we update$H_t(A_t)$:

$H_{t+1}(A_t)= H_t(A_t) + \alpha(R_t - \bar R_t)(1-\pi_t(At))$

and all other actions$a$ we did not chose with:

$H_{t+1}(a)= H_t(A_t) - \alpha(R_t - \bar R_t)(1-\pi_t(a))$ $\bar R_t$ is the overall average reward so far and can be used as a baseline.

Initially at$t=0$ all$H_{t=0}(a)$ are$0$.

Implement the Gradient Bandit Algorithm for the 10 armend bandit problem. The steps taken here are fairly equivalent to task 2.

• implement the softmax function for the numerical preference H
• implement the update rule for the preference H
• write a function play_bandit_grad which plays the bandit machine according to the Gradient Bandit Algorithm
• write a function get_percentage_optimal_action that returns the percentage of optimal actions taken for each step, test different alpha's with and without baseline $\bar R_t$ is not updated and stays$0$). The plot should show, that using a baseline has a significant impact over not using the baseline: Tip: for the last part use at least 100 runs for nicely shaped plots

### Solution

def softmax(H):
raise NotImplementedError
def update_H(action, R_avg, R_t, H, alpha=0.1):
"""
Parameters
----------

action: integer
current action taken

R_avg: float
average reward until time step t

R_t: float
curent reward at time step t

H: float
numerical preference

alpha: float
step size parameter

"""
raise NotImplementedError
bandits_generation_mean = 4.
bandits_means = np.random.randn(10) + bandits_generation_mean
optimal_action = np.argmax(bandits_means)
def play_bandit_grad(optimal_action, nb_steps=1000,
bandit_fct=bandit_fct,
use_R_avg_baseline=True, alpha=0.1):
"""
Parameters
----------

optimal_action: integer
optimal action which can be taken

nb_steps: integer
number of steps a bandit is played

bandit_fct:
function that simulates a bandit play

use_R_avg_baseline: bool
is true R_avg is updated at each steps according to moving
averages

alpha: float
step size parameter

"""

H=np.zeros(10)
R_avg = 0.
is_optimal_action = np.zeros(nb_steps) # history of optimal steps taken

return H, R_avg, is_optimal_action
nb_steps=1000
def get_percentage_optimal_action(n=100, alpha=0.1, use_R_avg_baseline=True):
percentage_optimal_action=np.zeros(nb_steps)

"""
Parameters
----------

n: integer
number of runs, how many bandit machines are played

alpha: float
step size parameter

use_R_avg_baseline: bool
is true R_avg is updated at each steps according to moving
averages

"""
return percentage_optimal_action/n
percentage_optimal_action_a01 = get_percentage_optimal_action(n=100, alpha=0.1)
percentage_optimal_action_a01_no_baseline = get_percentage_optimal_action(n=100, alpha=0.1, use_R_avg_baseline=False)
percentage_optimal_action_a04 = get_percentage_optimal_action(n=100, alpha=0.4)
percentage_optimal_action_a04_no_baseline = get_percentage_optimal_action(n=100, alpha=0.4, use_R_avg_baseline=False)
plt.figure(figsize=(7,7))
plt.plot(np.arange(nb_steps), percentage_optimal_action_a01, label=r'$\alpha=0.1$')
plt.plot(np.arange(nb_steps), percentage_optimal_action_a04, label=r'$\alpha=0.4$')
plt.plot(np.arange(nb_steps), percentage_optimal_action_a01_no_baseline, label=r'$\alpha=0.1$, without baseline')
plt.plot(np.arange(nb_steps), percentage_optimal_action_a04_no_baseline, label=r'$\alpha=0.4$, without baseline')
plt.legend(loc=4)
plt.xlabel("steps")
plt.ylabel("% optimal action")
plt.grid()
plt.show()

## Literature

### Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise Bandits 10 armend testbed

Christian Herta, Oliver Fischer