Exercise - The 10 Armed Bandit Problem

Introduction

In this notebook you will learn the basic principles of reinforcment learning and see how it differs from supervised learning. Classicle supervised learning is purely based on learning a certain task like classification or regression given a set of training variables, features and labels. So while training you already know what to aim for. Reinforcment learning on the other hand is more "learning by doing". So at start you do not know what to aim for but instead you learn based on an action reward system. Each action you perform gives back a reward and your task is it to maximize the reward over time.

A quite good and easy to understand task to understand this interplay bewteen action and reward is to learn how to play bandit machines or a multi armend bandit machine. Each arm represents a distribution of rewards and your task will it be to learn which arm to choose in order to maximize the reward over many tries.

Requirements

Knowledge

  • basic terms: action, reward, value function *$ \epsilon-greedy $ method, incremental implementation

Prerequisites

To solve this notebook read SUT18 until chapter 2.4. SUT18 is the standard literature for reinforcment learning and the basis for this and following notebooks.

Python Modules

import numpy as np
import types
import matplotlib.pyplot as plt

%matplotlib inline

Exercises

The following exercises are based on chapter 2.3 and 2.4 in [SUT18]. Although the problem to be solved is way older. In general, it is about using ressources to either:

  • acquire knew knowledge ("exploration")
  • or to optimize your actions based on existing knowledge ("exploitation").

Imagine you are in a casino seeing ten one armed bandits in front of you.

internet connection needed. picture with some one armed bandits.

You have a limited number of trials (e.g. 1.000) and at the beginning the only thing you know, is that each bandit gives another reward. The reward might be normally distributed, like in the picture below:

internet connection needed

  • Choice among k (here 10) different options, or actions.
  • After each choice you receive a numerical reward $ R_t $ (from a stationary probability distribution, here normal distributions) depending on the action you selected.
  • Objective: maximize the expected total reward over some time period, (e.g. over 1000 action selections).

The Bandits

Before we start to solve our objective, we first need to create some bandits.

Task 1

Write a function get_bandit_function which returns a function bandit_fct representing the bandit. bandit_fct returns the reward ,based on a reward distribution, given for a certain action (using a bandit arm). The means for all 10 bandit_fct distributions should be selected from a normal distribution with mean zero and unit variance and all bandit_fct distributions are normal distributions with these selected mean values and also unit variance.

np.random.seed(43)
bandit_means = np.random.normal(10,1,10)
bandit_means
#####################################
######## Code to generate picture ###
#####################################
plt.plot([0,11],[0,0],linestyle='--')
plt.plot(np.arange(10)+1,bandit_means,'ro',label='$Bandit \ means \ \mu$')
plt.errorbar(np.arange(10)+1,bandit_means,yerr=np.ones(10),fmt='none', label='Standard deviation $\ \sigma$')
plt.grid()
plt.title('Reward normal distributions of the bandits')
plt.ylim(min(bandit_means)-2, max(bandit_means)+2)
plt.xlabel('Bandit number')
plt.ylabel('Expected reward')
plt.legend()
plt.show()

Solution

$ \DeclareMathOperator*{\argmax}{arg\,max} % in your preamble \DeclareMathOperator*{\argmin}{arg\,min} % in your preamble $

def get_bandit_function(bandit_means, sigma = 1):
    raise NotImplementedError()
### get a bandit (e.g. lambda function)
bandit_fct = get_bandit_function(bandit_means, sigma=1)
### make a call
print('Test action 4')
print(bandit_fct(3)) # index start at 0 so this is actually action nr.4
### This should not throw an exception if your implementation is correct
nb_samples = 100000
action_to_test = 3 

assert isinstance(bandit_fct, types.FunctionType)
print('Test 1 passed')

mean_action_ = np.ndarray(nb_samples)
for i in range(nb_samples):
    mean_action_[i] = bandit_fct(action_to_test)
np.testing.assert_almost_equal(mean_action_.mean(),bandit_means[action_to_test], decimal=2)
print('Test 2 passed')

Notation

The Problem

On each of an infinite sequence of time steps, $ t=1,2,3,… $, you choose an action $ A_t $ from $ k $ possibilities, and receive a real-valued reward $ R_t $ The reward depends only on the action taken; it is indentically and independently distributed (i.i.d.)

The value then of an arbitrary action $ a $ , denoted $ q_∗(a) $ , is the expected reward given that action$ a $ is selected:

$ q_*(a) \dot = \mathbb E [ R_t | A_t = a], \forall a \in {1,2,\ldots,k} $

These values (true values) are unknown. The distribution is unknown.

Nevertheless, you must maximize your total reward.

You must both try actions to learn their values (explore), and prefer those that appear best (exploit)

Exploration / Exploitation Dilemma

Suppose you form estimates (action value estimates):

$ Q_t(a) \approx q_∗(a) \forall a $

Define the greedy action at time $ t $ as

$ \DeclareMathOperator*{\argmax}{arg\,max} % in your preamble \DeclareMathOperator*{\argmin}{arg\,min} % in your preamble $

$ A^∗_t := \argmax_a Q_t(a) $

  • If $ A_t=A^∗_t $ then you are exploiting.

  • If $ A_t \neq A^∗_t $ then you are exploring.

You can’t do both, but you need to do both

You can never stop exploring, but maybe you should explore less with time. Or maybe not.

Action Value Methods

Methods that learn Action-Value estimates and nothing else.

For example, estimate action-values as sample averages:

$ Q_t(a) \dot{=} \frac{\text{sum of rewards when a taken prior to } t}{\text{number of times a taken prior to } t} = \frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_{i=a}}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_{i=a}}} $

The sample-average estimates converge to true values. If the action is taken an infinite number of times

$ \lim_{N_t(a) \rightarrow \infty} Q_t(a) = q_*(a) $

with$ N_t(a) $, the number of times action $ a $ has been taken by time $ t $.

Averaging

To calculate the estimate$ Q_{n_a}(a) $ of an action$ a $, we calculate the average after$ n_a+1 $ rewards:

$ Q_{n_a}(a) \dot{=} \frac{R_1 + R_2 + \ldots + R_{n_a+1}}{n_a-1} $

This would require us to store all rewards. To save memory, we can calculate the so called running average, only storing the latest$ Q_{n_a}(a) $ and$ n_a $:

$ Q_{n_a+1}(a) \dot{=} Q_{n_a}(a) + \frac{1}{n_a} [ R_{n_a} - Q_{n_a}(a) ] $

The epsilon-greedy action selection

In greedy action selection, you always exploit.

In $ \epsilon $ -greedy, you are usually greedy, but with probability $ \epsilon $ you instead pick an action at random (also possibly the greedy action again).

This is perhaps the simplest way to balance exploration and exploitation

Task 2

Implement the action-value method with the incremental implementation,$ \epsilon $-greedy algorithm, for the 10 armend bandit problem.

a) implement the update rule for the action-values Q, update_Qs(action, reward, Qs, Ns)

b) write a function that plays a bandit and returns Q values, rewards and actions done play_a_bandit(epsilon, bandit_fct, optimistic=0., nb_steps = 1000, nb_arms=10):

c) to "analyze" your algorithm perform multiple runs, always using a new 10 armed bandit, and calculate the average reward and percentage of optimal choices for each step over these multiple runs, visualize your results for different$ \epsilon $ values. Your plots should look like the following:

internet connection needed internet connection needed

Hint:

For the last part use at least 500 runs for nicely shaped plots

Solution a)

def update_Qs(action, reward, Qs, Ns):
    raise NotImplementedError
### Needed fo the next cell to validate your implementation

def init_Ns_Qs_Rs():
    np.random.seed(44)
    Ns = [0 for i in range(10)]
    Qs = [0 for i in range(10)]
    Rs = [np.random.randint(0,10, size=np.random.randint(0,10, size=1)) for i in range(10)]
    return Ns, Qs, Rs

Ns, Qs, Rs = init_Ns_Qs_Rs()
print(Ns)
print(Qs)
print(Rs)
### This cell should not throw an exception

Ns, Qs, Rs = init_Ns_Qs_Rs()

for action, rewards in enumerate(Rs):
    print('action: ', action, 'rewards: ', rewards)
    for r in rewards:
        update_Qs(action, r, Qs, Ns)

print(Ns)
print(Qs)
        
assert Ns[0] is 4
assert Ns[1] is 4
assert Ns[2] is 6
assert Ns[3] is 5
assert Ns[4] is 7
assert Ns[5] is 8
assert Ns[6] is 2
assert Ns[7] is 2
assert Ns[8] is 8
assert Ns[9] is 5
assert Qs[0] == 1.75
assert Qs[1] == 6.25
assert Qs[2] ==(5.166666666666667)
assert Qs[3] == 5.4
assert Qs[4] == 3.8571428571428577
assert Qs[5] == 4.249999999999999
assert Qs[6] == 1.5
assert Qs[7] == 5.5
assert Qs[8] == 5.125
assert Qs[9] == 3.8

Solution b)

def play_a_bandit(epsilon, bandit_fct, optimistic=0., nb_steps = 1000, nb_arms=10):
    """    
    Parameters
    ----------
    epsilon : float
        epsilon: probability of chosing a random action (exploration)
        
    optimistic: float
        Initial value for the Q-values (for all action)
        
    nb_steps: int
        Number of steps. How often the bandit is played
        
    """
    raise NotImplementedError
    
    return Qs, rewards, actions

Solution c)

######################
### Your code here ###
######################

Gradient Bandit Algorithm

Another approach to solve the Bandit problem is the Gradient Bandit Algorithm presented in chapter 2.8 [SUT18]

Instead of selecting only the best considdered bandit with probability$ 1 - \epsilon $ or uniformly any bandit with probability$ \epsilon $, we can assign a preference$ H_t(a) $ for each bandit, which we dynamically adjust after each time step$ t $.

These$ H_t(a) $ are then used to computed the probabilities which action (bandit)$ A_t $ to chose at current time step$ t $, with the softmax function:

$ P(A_t = a) = \frac{\exp(H_t(a)}{\sum^k_{b=1}\exp(H_t(b)} = \pi_t(a) $

After we receive our reward$ R_t $ with action$ A_t $ we update$ H_t(A_t) $:

$ H_{t+1}(A_t)= H_t(A_t) + \alpha(R_t - \bar R_t)(1-\pi_t(At)) $

and all other actions$ a $ we did not chose with:

$ H_{t+1}(a)= H_t(A_t) - \alpha(R_t - \bar R_t)(1-\pi_t(a)) $ $ \bar R_t $ is the overall average reward so far and can be used as a baseline.

Initially at$ t=0 $ all$ H_{t=0}(a) $ are$ 0 $.

Task 3

Implement the Gradient Bandit Algorithm for the 10 armend bandit problem. The steps taken here are fairly equivalent to task 2.

  • implement the softmax function for the numerical preference H
  • implement the update rule for the preference H
  • write a function play_bandit_grad which plays the bandit machine according to the Gradient Bandit Algorithm
  • write a function get_percentage_optimal_action that returns the percentage of optimal actions taken for each step, test different alpha's with and without baseline $ \bar R_t $ is not updated and stays$ 0 $). The plot should show, that using a baseline has a significant impact over not using the baseline:

internet connection needed

Tip: for the last part use at least 100 runs for nicely shaped plots

Solution

def softmax(H):
    raise NotImplementedError
def update_H(action, R_avg, R_t, H, alpha=0.1):
    """    
    Parameters
    ----------
    
    action: integer
        current action taken
        
    R_avg: float
        average reward until time step t
        
    R_t: float
        curent reward at time step t
    
    H: float
        numerical preference
    
    alpha: float
        step size parameter     
   
    """
    raise NotImplementedError
bandits_generation_mean = 4.
bandits_means = np.random.randn(10) + bandits_generation_mean
optimal_action = np.argmax(bandits_means)
def play_bandit_grad(optimal_action, nb_steps=1000, 
                     bandit_fct=bandit_fct,
                    use_R_avg_baseline=True, alpha=0.1):
    """    
    Parameters
    ----------
    
    optimal_action: integer
        optimal action which can be taken
    
    nb_steps: integer
        number of steps a bandit is played
    
    bandit_fct:
        function that simulates a bandit play
    
    use_R_avg_baseline: bool
        is true R_avg is updated at each steps according to moving 
        averages
    
     alpha: float
        step size parameter
    
    """
    
    H=np.zeros(10)
    R_avg = 0.
    is_optimal_action = np.zeros(nb_steps) # history of optimal steps taken
    
    return H, R_avg, is_optimal_action
nb_steps=1000
def get_percentage_optimal_action(n=100, alpha=0.1, use_R_avg_baseline=True):
    percentage_optimal_action=np.zeros(nb_steps)
    
    """    
    Parameters
    ----------
    
    n: integer
        number of runs, how many bandit machines are played
    
    alpha: float
        step size parameter
    
    use_R_avg_baseline: bool
        is true R_avg is updated at each steps according to moving 
        averages
    
    """
    return percentage_optimal_action/n
percentage_optimal_action_a01 = get_percentage_optimal_action(n=100, alpha=0.1)
percentage_optimal_action_a01_no_baseline = get_percentage_optimal_action(n=100, alpha=0.1, use_R_avg_baseline=False)
percentage_optimal_action_a04 = get_percentage_optimal_action(n=100, alpha=0.4)
percentage_optimal_action_a04_no_baseline = get_percentage_optimal_action(n=100, alpha=0.4, use_R_avg_baseline=False)
plt.figure(figsize=(7,7))
plt.plot(np.arange(nb_steps), percentage_optimal_action_a01, label=r'$\alpha=0.1$')
plt.plot(np.arange(nb_steps), percentage_optimal_action_a04, label=r'$\alpha=0.4$')
plt.plot(np.arange(nb_steps), percentage_optimal_action_a01_no_baseline, label=r'$\alpha=0.1$, without baseline')
plt.plot(np.arange(nb_steps), percentage_optimal_action_a04_no_baseline, label=r'$\alpha=0.4$, without baseline')
plt.legend(loc=4)
plt.xlabel("steps")
plt.ylabel("% optimal action")
plt.grid()
plt.show()

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise Bandits 10 armend testbed

Christian Herta, Oliver Fischer

is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2019 Christian Herta, Oliver Fischer

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.