Exercise  The 10 Armed Bandit Problem
Table of Contents
Introduction
In this notebook you will learn the basic principles of reinforcment learning and see how it differs from supervised learning. Classicle supervised learning is purely based on learning a certain task like classification or regression given a set of training variables, features and labels. So while training you already know what to aim for. Reinforcment learning on the other hand is more "learning by doing". So at start you do not know what to aim for but instead you learn based on an action reward system. Each action you perform gives back a reward and your task is it to maximize the reward over time.
A quite good and easy to understand task to understand this interplay bewteen action and reward is to learn how to play bandit machines or a multi armend bandit machine. Each arm represents a distribution of rewards and your task will it be to learn which arm to choose in order to maximize the reward over many tries.
Requirements
Knowledge
 basic terms: action, reward, value function *$ \epsilongreedy $ method, incremental implementation
Prerequisites
To solve this notebook read SUT18
until chapter 2.4. SUT18
is the standard literature for reinforcment learning and the basis for this and following notebooks.
Python Modules
import numpy as np
import types
import matplotlib.pyplot as plt
%matplotlib inline
Exercises
The following exercises are based on chapter 2.3 and 2.4 in [SUT18]. Although the problem to be solved is way older. In general, it is about using ressources to either:
 acquire knew knowledge ("exploration")
 or to optimize your actions based on existing knowledge ("exploitation").
Imagine you are in a casino seeing ten one armed bandits in front of you.
You have a limited number of trials (e.g. 1.000) and at the beginning the only thing you know, is that each bandit gives another reward. The reward might be normally distributed, like in the picture below:
 Choice among k (here 10) different options, or actions.
 After each choice you receive a numerical reward $ R_t $ (from a stationary probability distribution, here normal distributions) depending on the action you selected.
 Objective: maximize the expected total reward over some time period, (e.g. over 1000 action selections).
The Bandits
Before we start to solve our objective, we first need to create some bandits.
Task 1
Write a function get_bandit_function
which returns a function bandit_fct
representing the bandit. bandit_fct
returns the reward ,based on a reward distribution, given for a certain action (using a bandit arm). The means for all 10 bandit_fct
distributions should be selected from a normal distribution with mean zero and unit variance and all bandit_fct
distributions are normal distributions with these selected mean values and also unit variance.
np.random.seed(43)
bandit_means = np.random.normal(10,1,10)
bandit_means
#####################################
######## Code to generate picture ###
#####################################
plt.plot([0,11],[0,0],linestyle='')
plt.plot(np.arange(10)+1,bandit_means,'ro',label='$Bandit \ means \ \mu$')
plt.errorbar(np.arange(10)+1,bandit_means,yerr=np.ones(10),fmt='none', label='Standard deviation $\ \sigma$')
plt.grid()
plt.title('Reward normal distributions of the bandits')
plt.ylim(min(bandit_means)2, max(bandit_means)+2)
plt.xlabel('Bandit number')
plt.ylabel('Expected reward')
plt.legend()
plt.show()
Solution
$ \DeclareMathOperator*{\argmax}{arg\,max} % in your preamble \DeclareMathOperator*{\argmin}{arg\,min} % in your preamble $
def get_bandit_function(bandit_means, sigma = 1):
raise NotImplementedError()
### get a bandit (e.g. lambda function)
bandit_fct = get_bandit_function(bandit_means, sigma=1)
### make a call
print('Test action 4')
print(bandit_fct(3)) # index start at 0 so this is actually action nr.4
### This should not throw an exception if your implementation is correct
nb_samples = 100000
action_to_test = 3
assert isinstance(bandit_fct, types.FunctionType)
print('Test 1 passed')
mean_action_ = np.ndarray(nb_samples)
for i in range(nb_samples):
mean_action_[i] = bandit_fct(action_to_test)
np.testing.assert_almost_equal(mean_action_.mean(),bandit_means[action_to_test], decimal=2)
print('Test 2 passed')
Notation
The Problem
On each of an infinite sequence of time steps, $ t=1,2,3,… $, you choose an action $ A_t $ from $ k $ possibilities, and receive a realvalued reward $ R_t $ The reward depends only on the action taken; it is indentically and independently distributed (i.i.d.)
The value then of an arbitrary action $ a $ , denoted $ q_∗(a) $ , is the expected reward given that action$ a $ is selected:
$ q_*(a) \dot = \mathbb E [ R_t  A_t = a], \forall a \in {1,2,\ldots,k} $
These values (true values) are unknown. The distribution is unknown.
Nevertheless, you must maximize your total reward.
You must both try actions to learn their values (explore), and prefer those that appear best (exploit)
Exploration / Exploitation Dilemma
Suppose you form estimates (action value estimates):
$ Q_t(a) \approx q_∗(a) \forall a $
Define the greedy action at time $ t $ as
$ \DeclareMathOperator*{\argmax}{arg\,max} % in your preamble \DeclareMathOperator*{\argmin}{arg\,min} % in your preamble $
$ A^∗_t := \argmax_a Q_t(a) $

If $ A_t=A^∗_t $ then you are exploiting.

If $ A_t \neq A^∗_t $ then you are exploring.
You can’t do both, but you need to do both
You can never stop exploring, but maybe you should explore less with time. Or maybe not.
Action Value Methods
Methods that learn ActionValue estimates and nothing else.
For example, estimate actionvalues as sample averages:
$ Q_t(a) \dot{=} \frac{\text{sum of rewards when a taken prior to } t}{\text{number of times a taken prior to } t} = \frac{\sum_{i=1}^{t1} R_i \cdot \mathbb{1}_{A_{i=a}}}{\sum_{i=1}^{t1} \mathbb{1}_{A_{i=a}}} $
The sampleaverage estimates converge to true values. If the action is taken an infinite number of times
$ \lim_{N_t(a) \rightarrow \infty} Q_t(a) = q_*(a) $
with$ N_t(a) $, the number of times action $ a $ has been taken by time $ t $.
Averaging
To calculate the estimate$ Q_{n_a}(a) $ of an action$ a $, we calculate the average after$ n_a+1 $ rewards:
$ Q_{n_a}(a) \dot{=} \frac{R_1 + R_2 + \ldots + R_{n_a+1}}{n_a1} $
This would require us to store all rewards. To save memory, we can calculate the so called running average, only storing the latest$ Q_{n_a}(a) $ and$ n_a $:
$ Q_{n_a+1}(a) \dot{=} Q_{n_a}(a) + \frac{1}{n_a} [ R_{n_a}  Q_{n_a}(a) ] $
The epsilongreedy action selection
In greedy action selection, you always exploit.
In $ \epsilon $ greedy, you are usually greedy, but with probability $ \epsilon $ you instead pick an action at random (also possibly the greedy action again).
This is perhaps the simplest way to balance exploration and exploitation
Task 2
Implement the actionvalue method with the incremental implementation,$ \epsilon $greedy algorithm, for the 10 armend bandit problem.
a) implement the update rule for the actionvalues Q, update_Qs(action, reward, Qs, Ns)
b) write a function that plays a bandit and returns Q values, rewards and actions done
play_a_bandit(epsilon, bandit_fct, optimistic=0., nb_steps = 1000, nb_arms=10):
c) to "analyze" your algorithm perform multiple runs, always using a new 10 armed bandit, and calculate the average reward and percentage of optimal choices for each step over these multiple runs, visualize your results for different$ \epsilon $ values. Your plots should look like the following:
Hint:
For the last part use at least 500 runs for nicely shaped plots
Solution a)
def update_Qs(action, reward, Qs, Ns):
raise NotImplementedError
### Needed fo the next cell to validate your implementation
def init_Ns_Qs_Rs():
np.random.seed(44)
Ns = [0 for i in range(10)]
Qs = [0 for i in range(10)]
Rs = [np.random.randint(0,10, size=np.random.randint(0,10, size=1)) for i in range(10)]
return Ns, Qs, Rs
Ns, Qs, Rs = init_Ns_Qs_Rs()
print(Ns)
print(Qs)
print(Rs)
### This cell should not throw an exception
Ns, Qs, Rs = init_Ns_Qs_Rs()
for action, rewards in enumerate(Rs):
print('action: ', action, 'rewards: ', rewards)
for r in rewards:
update_Qs(action, r, Qs, Ns)
print(Ns)
print(Qs)
assert Ns[0] is 4
assert Ns[1] is 4
assert Ns[2] is 6
assert Ns[3] is 5
assert Ns[4] is 7
assert Ns[5] is 8
assert Ns[6] is 2
assert Ns[7] is 2
assert Ns[8] is 8
assert Ns[9] is 5
assert Qs[0] == 1.75
assert Qs[1] == 6.25
assert Qs[2] ==(5.166666666666667)
assert Qs[3] == 5.4
assert Qs[4] == 3.8571428571428577
assert Qs[5] == 4.249999999999999
assert Qs[6] == 1.5
assert Qs[7] == 5.5
assert Qs[8] == 5.125
assert Qs[9] == 3.8
Solution b)
def play_a_bandit(epsilon, bandit_fct, optimistic=0., nb_steps = 1000, nb_arms=10):
"""
Parameters

epsilon : float
epsilon: probability of chosing a random action (exploration)
optimistic: float
Initial value for the Qvalues (for all action)
nb_steps: int
Number of steps. How often the bandit is played
"""
raise NotImplementedError
return Qs, rewards, actions
Solution c)
######################
### Your code here ###
######################
Gradient Bandit Algorithm
Another approach to solve the Bandit problem is the Gradient Bandit Algorithm
presented in chapter 2.8 [SUT18]
Instead of selecting only the best considdered bandit with probability$ 1  \epsilon $ or uniformly any bandit with probability$ \epsilon $, we can assign a preference$ H_t(a) $ for each bandit, which we dynamically adjust after each time step$ t $.
These$ H_t(a) $ are then used to computed the probabilities which action (bandit)$ A_t $ to chose at current time step$ t $, with the softmax function:
$ P(A_t = a) = \frac{\exp(H_t(a)}{\sum^k_{b=1}\exp(H_t(b)} = \pi_t(a) $
After we receive our reward$ R_t $ with action$ A_t $ we update$ H_t(A_t) $:
$ H_{t+1}(A_t)= H_t(A_t) + \alpha(R_t  \bar R_t)(1\pi_t(At)) $
and all other actions$ a $ we did not chose with:
$ H_{t+1}(a)= H_t(A_t)  \alpha(R_t  \bar R_t)(1\pi_t(a)) $ $ \bar R_t $ is the overall average reward so far and can be used as a baseline.
Initially at$ t=0 $ all$ H_{t=0}(a) $ are$ 0 $.
Task 3
Implement the Gradient Bandit Algorithm
for the 10 armend bandit problem. The steps taken here are fairly equivalent to task 2.
 implement the
softmax
function for the numerical preferenceH
 implement the update rule for the preference
H
 write a function
play_bandit_grad
which plays the bandit machine according to theGradient Bandit Algorithm
 write a function
get_percentage_optimal_action
that returns the percentage of optimal actions taken for each step, test different alpha's with and without baseline $ \bar R_t $ is not updated and stays$ 0 $). The plot should show, that using a baseline has a significant impact over not using the baseline:
Tip: for the last part use at least 100 runs for nicely shaped plots
Solution
def softmax(H):
raise NotImplementedError
def update_H(action, R_avg, R_t, H, alpha=0.1):
"""
Parameters

action: integer
current action taken
R_avg: float
average reward until time step t
R_t: float
curent reward at time step t
H: float
numerical preference
alpha: float
step size parameter
"""
raise NotImplementedError
bandits_generation_mean = 4.
bandits_means = np.random.randn(10) + bandits_generation_mean
optimal_action = np.argmax(bandits_means)
def play_bandit_grad(optimal_action, nb_steps=1000,
bandit_fct=bandit_fct,
use_R_avg_baseline=True, alpha=0.1):
"""
Parameters

optimal_action: integer
optimal action which can be taken
nb_steps: integer
number of steps a bandit is played
bandit_fct:
function that simulates a bandit play
use_R_avg_baseline: bool
is true R_avg is updated at each steps according to moving
averages
alpha: float
step size parameter
"""
H=np.zeros(10)
R_avg = 0.
is_optimal_action = np.zeros(nb_steps) # history of optimal steps taken
return H, R_avg, is_optimal_action
nb_steps=1000
def get_percentage_optimal_action(n=100, alpha=0.1, use_R_avg_baseline=True):
percentage_optimal_action=np.zeros(nb_steps)
"""
Parameters

n: integer
number of runs, how many bandit machines are played
alpha: float
step size parameter
use_R_avg_baseline: bool
is true R_avg is updated at each steps according to moving
averages
"""
return percentage_optimal_action/n
percentage_optimal_action_a01 = get_percentage_optimal_action(n=100, alpha=0.1)
percentage_optimal_action_a01_no_baseline = get_percentage_optimal_action(n=100, alpha=0.1, use_R_avg_baseline=False)
percentage_optimal_action_a04 = get_percentage_optimal_action(n=100, alpha=0.4)
percentage_optimal_action_a04_no_baseline = get_percentage_optimal_action(n=100, alpha=0.4, use_R_avg_baseline=False)
plt.figure(figsize=(7,7))
plt.plot(np.arange(nb_steps), percentage_optimal_action_a01, label=r'$\alpha=0.1$')
plt.plot(np.arange(nb_steps), percentage_optimal_action_a04, label=r'$\alpha=0.4$')
plt.plot(np.arange(nb_steps), percentage_optimal_action_a01_no_baseline, label=r'$\alpha=0.1$, without baseline')
plt.plot(np.arange(nb_steps), percentage_optimal_action_a04_no_baseline, label=r'$\alpha=0.4$, without baseline')
plt.legend(loc=4)
plt.xlabel("steps")
plt.ylabel("% optimal action")
plt.grid()
plt.show()
Literature
Licenses
Notebook License (CCBYSA 4.0)
The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).
Exercise Bandits 10 armend testbed
Christian Herta, Oliver Fischer
is licensed under a Creative Commons AttributionShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.
Code License (MIT)
The following license only applies to code cells of the notebook.
Copyright 2019 Christian Herta, Oliver Fischer
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.