Exercise - Bundesliga Game Prediction with PyMC3

Introduction

In this exercises you will define a simple model for predicting soccer games for the German "Bundesliga" (1st League) based prior games using Pyro.

Remark: In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal. These statements raise exceptions, as long as the calculated result is not yet correct.

Requirements

Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

Python Modules

import numpy as np
import pandas as pd
import pymc3 as pm
import scipy.stats
import theano

from theano import tensor as T
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize

%matplotlib inline

Exercises

We have the data of past "Bundesliga" games (1st soccer league in Germany) from 2017 and 2018. Specifically, each datapoint contains the names of the two teams ("Verein" in german) playing agains each other and the final score.

One file contains the names of the teams, e.g. the first two lines (and header lines of bundesliga_Verein.csv:

V_ID;Name;Liga
1;FC Bayern München;1
2;FC Schalke 04;1
...

The header line V_ID;Name;Liga translates into the columns team_ID; name, league

The second file contains ~1.300 games, e.g.:

Spiel_ID;Spieltag;Datum;Uhrzeit;Heim;Gast;Tore_Heim;Tore_Gast
1;1;2017-08-18;20:30:00;1;5;3;1
2;1;2017-08-19;15:30:00;7;12;1;0
...

The header line here means game_ID;date_of_play;date;time;home_team_ID,guest_team_ID;score_home_team;score_guest_team

Based on this data, our goal is to predict the score of upcoming games.

Loading and Preprocessing of Data

First we'll loead the data from the corresponding csv-files and do some preprocessing. As the files not only contain games of the 1st league we have to filter them. We do the data loading and preprocessing using the pandas library's dataframe class.

### Load the data to lookup team-id and corresponding team-name
url_vereine_csv = "https://gitlab.com/deep.TEACHING/educational-materials/raw/dev/notebooks/data/bundesliga_Verein.csv"
### Alternative URL:
# url_vereine_csv = "https://github.com/hsro-wif-prg2/hsro-wif-prg2.github.io/raw/master/examples/src/main/resources/bundesliga_Verein.csv"
clubs = pd.read_csv(url_vereine_csv, sep=';')

### for convinience the club id should start with 0
clubs.V_ID = clubs.V_ID - 1
clubs = clubs.set_index("V_ID")

### just 1. league 
club_ids = clubs[clubs.Liga==1].index
club_ids
### Now load the data about past games from 2017 and 2018
url_spiele_csv = "https://gitlab.com/deep.TEACHING/educational-materials/raw/dev/notebooks/data/bundesliga_Spiel.csv"
### Alternative URL:
# url_spiele_csv = "https://github.com/hsro-wif-prg2/hsro-wif-prg2.github.io/raw/master/examples/src/main/resources/bundesliga_Spiel.csv"
games = pd.read_csv(url_spiele_csv, sep=';')

#del(games["Unnamed: 8"]) ### not existent anymore?

### for convinience the club id should start with 0
games.Heim = games.Heim-1
games.Gast = games.Gast-1
### Filter for games of teams in 1st league
relevant_games = games[games.Heim.isin(club_ids)]

### Have a look at the data
relevant_games
### We'll use the games from 2017 for training the model
actual_date = "2018-01-01"
relevant_games = relevant_games[games.Datum < actual_date]
print('Size of training set: ', len(relevant_games))
### We are not interested in date and time data,
### so we extract only team ids and score:

def get_goal_results(gh="Tore_Gast"):
    result = list()
    for i in relevant_games.iterrows():
        ### index 0 contains pandas-dataframe id, 
        ### which we are not interested in.
        r = i[1]
        result.append((r.Heim, r.Gast, r[gh]))
    return result

### Lists of tuples
away_goals_ = get_goal_results("Tore_Gast")
home_goals_ = get_goal_results("Tore_Heim")

### Print one example
print('First tuple of list \'away_goals_\':')
print('(id_home, id_guest, goals_guest)')
print(away_goals_[0])

print('\nFirst tuple of list \'away_goals_\':')
print('(id_home, id_guest, goals_home)')
print(home_goals_[0])

Poisson distribution

As the number of goals$ k $ a team scores is not continous but a discrete value, it can be modeled with a Poisson distribution$ P_\lambda (k) $:

Probability for outcome$ k \in \{0, 1, 2, \dots\} $

$ P_\lambda (Z=k) = \frac{\lambda^k}{k!}\, \mathrm{e}^{-\lambda} $

with parameter$ \lambda>0 $

Notes:

  • If a random variable$ Z $ has a Poisson mass distribution, we denote this by writing

$ Z \sim \text{Poi}(\lambda) $

-$ \lambda $ is also the expectation and variance of the Poisson distribution

$ E\left[ \;Z\; | \; \lambda \;\right] = \sigma^2 \left[ \;Z\; | \; \lambda \; \right] = \lambda $

  • PMFs (probability mass functions) for discrete variables sum up to$ 1.0 $
### An example for a poisson distribution
k=np.arange(0,10)
lambda_= 3.1
probs = scipy.stats.poisson.pmf(k, lambda_)
print('sum ob probabilities (might not be exactly 1.0 due to numerical precision):\n', probs.sum())

### Let us plot it
plt.figure(figsize=(8,6))
plt.plot(k, probs, 'bo', ms=6, label='poisson pmf')
plt.xlabel("k")
plt.ylabel("probability mass")
scipy.stats.poisson.pmf(k, lambda_)

Probabilistic Model

Let us assume each team$ i $ has an offence and defence strength (distribution). The average goals per game is approximatley$ 3 $ $ \approx 3 \Rightarrow \Delta \mu=1.5 $).

Since both teams score, on average each team scores$ 1.5 $ goals. So we model the offence as a normal (Gaussian) distribution with mean$ \mu=1.5 $:

$ offence_i \sim \mathcal N(\mu=1.5, \tau=1) $ $ defence_i \sim \mathcal N(\mu=0, \tau=1) $ $ \mathcal N $ is the Gaussian distribution with parameters

  • mean:$ \mu $
  • precision:$ \tau=1/\sigma^2 $ (variance:$ \sigma^2 $)
  • Gaussian distribution can either be specified with sigma$ \sigma $ or tau$ \tau $

Model:

The number of goals that team$ i $ scores against team$ j $ is Poisson distributed with

$ goals_{ij} = Poisson \left(\lambda = (offence_i-defence_j) \right) $

Further Notes:

  • We also could have chosen$ \mu_{defense} $ to be greater or smaller than$ 0 $, e.g.$ \mu_{defense}=1.0 $. But in order to model the average goals per game being$ 3 $ that would have led to$ \mu_{offense}=2.5 $. -$ \tau = 1.0 $ is also arbitrary and could be anything, as long as$ \tau_{offense} = \tau_{defense} $.

Graphical Representation of the Model

import daft
def plot_model():
    pgm = daft.PGM([6.3, 4.05], origin=[-1., -1.])
    pgm.add_node(daft.Node("mu_o", r"$\mu_o$", .5, .5, fixed=True))
    pgm.add_node(daft.Node("tau_o", r"$\tau_o$", .5, 1.5, fixed=True))    
    pgm.add_node(daft.Node("o_i", r"o$_i$", 1.5, 1))
    
    pgm.add_node(daft.Node("tau_d", r"$\tau_d$", 2., 3., fixed=True))
    pgm.add_node(daft.Node("mu_d", r"$\mu_d$", 3., 3., fixed=True))
    pgm.add_node(daft.Node("d_j", r"d$_j$", 2.5, 2.2))
    
    pgm.add_node(daft.Node("g", r"g$_{ij}$", 2.5, 1., observed=True))

    
    # Add in the edges.
    pgm.add_edge("mu_o", "o_i")
    pgm.add_edge("tau_o", "o_i")
    pgm.add_edge("mu_d", "d_j")
    pgm.add_edge("tau_d", "d_j")
    pgm.add_edge("o_i", "g")
    pgm.add_edge("d_j", "g")
    # And plates.
    pgm.add_plate(daft.Plate([2., 0.2, 1., 2.5], label=r"$j$", shift=0.))
    pgm.add_plate(daft.Plate([1., 0.5, 2.2, 1.1], label=r"$i$", shift=0.))
    pgm.render()
plot_model()

Implementation with PyMC

Let us define the model with PyMC3:

  • First we define our normal distributions for offence strengths and defence strengths (one for each team)

For eacht tuple in home_goals_:

  • We iterate through our list of tuples home_goals_

  • Doing this we generate a list of goals scored at home

  • We also generate a list for the home_value, which is a result of offence-defence.

  • Afterwards we define the poisson distribution with goals scored and the mean

    • This is important: By specifing the observed parameter, we tell PyMC3 that this is our underlying data.

Task:

Below you see the code to iterate through the list home_goals_. Your task is to add the code to do the same for away_goals_.

### Number of clubs
nb_clubs = len(club_ids)

### offence-defence could be negative, but number of goals scored cannot.
### When dealing with distributions for continous variables, 
### e.g. gaussian (normal), theres never probability of exactly 0,
### so for zero we'll use something very small instead
low = 1e-10
model = pm.Model()

with model:
    offence = pm.Normal("offence", tau=1., mu=1.5, shape=nb_clubs)
    defence = pm.Normal("defence", tau=1., mu=0., shape=nb_clubs)

    home_goals = []
    home_values = []
    ### iterate through goals scored at home
    for i,(heim, gast, goals) in enumerate(home_goals_):
        home_value = offence[heim]-defence[gast]
        ### no negative home value
        home_value = T.switch(T.lt(home_value, 0.), low, home_value) 
        home_values.append(home_value)
        home_goals.append(goals)
    home_values_ = T.stack(home_values)
    mu_home = pm.Deterministic("home_rate", home_values_)
    pm.Poisson("home_goals", observed=home_goals, mu=mu_home)
    
    ######################
    ### YOUR CODE HERE ###
    ######################

Sampling with PyMC3

Now we can simple sample with PyMC3, per default NUTS (No-U-Turn-Sampler) is used.

### number of samples
nb_samples=1000
### tune adds additional number of samples.
### after sampling these additional samples get discarded
### as the very first samples are very inaccurate
tune = nb_samples//10

with model:
    trace = pm.sample(draws=nb_samples, tune=tune) 
### alternatively this could be used to discard first samples
### don't use the first samples - already considered by tune
burn = 0
trace = trace[burn:]
trace.get_values("offence")

The first thing we could do is to calculate the mean offence and defence strength of for the individula team. If your code so far is corret the output of the follwoing two cells should look similar to the following:

Mean offence strenght per team:
FC Bayern München   2.125811790340167
FC Schalke 04   1.7037162684586846
Borussia Dortmund   2.257390525324742
RB Leipzig   1.6020905056271877
Bayer Leverkusen   2.0050107075685344
...
Mean defence strenght per team:
FC Bayern München   0.7821524828924918
FC Schalke 04   0.054772520741117325
Borussia Dortmund   -0.06449457068251914
RB Leipzig   -0.006551429185531419
Bayer Leverkusen   0.06109513176503763
...
### mean of offence strength
print('Mean offence strenght per team:')
for i in club_ids: 
    print(clubs[clubs.index==i]["Name"][i], " ", trace.get_values("offence")[:,i].mean())

Plotting Results

But calculating the mean, we lose some very valuable information: the uncertainty, which is actually the reason we do probabilistic programming. So we better do not calculate the mean, but instead approximate the distribution of the offence and defence.

A simple way to do this is e.g. using histogram. Note that these histograms do not show us the exact distribtions, but just an approximation of their shape. The absolute values on the y-scale also just depend on the number of samples we used for trace = pm.sample(draws=nb_samples, tune=tune).

The plot should look similar to the following:

internet connection needed

nb_clubs = club_ids.max() + 1
club_ids_to_show = [0,1,2]
bins=40
fig, axes = plt.subplots(nrows=len(club_ids_to_show), ncols=2)

for i in club_ids_to_show: 
    title = "Offence of " + clubs[clubs.index==i]["Name"][i]
    axes[i, 0].set_title(title)
    axes[i, 0].hist(trace.get_values("offence")[:,i], bins=bins, range=(0,4.2))
    
    axes[i, 1].hist(trace.get_values("defence")[:,i], bins=bins, range=(-2.,2.2))
    title = "Defence of " + clubs[clubs.index==i]["Name"][i]
    axes[i, 1].set_title(title)

#fig.suptitle("Offence and defence distribution of the clubs.")
fig.subplots_adjust(hspace=0.5)
fig.tight_layout()

The smoothed PDFs (probability density functions) can also be plotted using the traceplot function. The first row of plots should look similar to the following plot:

internet connection needed

for i in club_ids_to_show: 
    title = "Offence of " + clubs[clubs.index==i]["Name"][i]
    ax = pm.traceplot(trace['offence'][:,i])
    ax[0,0].set_title(title + ' - smoothed pdf')
    ax[0,1].set_title(title + ' - sampling trace')
    title = "Defence of " + clubs[clubs.index==i]["Name"][i]
    ax = pm.traceplot(trace['defence'][:,i])
    ax[0,0].set_title(title + ' - smoothed pdf')
    ax[0,1].set_title(title + ' - sampling trace')

Expected Winner

Task:

Use the model for an estimate the probability that team 1 wins again team 2 (or a tie)?

Implement the function get_probs_winner(id_team1, id_team2).

Hint:

Use sampling from the (variational) probability densities:

  • First implement get_diffs(team_i, team_j) which calculates: *$ diff_{ij} = offence_i-defence_j $

    • And$ diff_{ji} = offence_j-defence_i $
    • Example output:

      •   goals_ij, goals_ji = get_diffs(0, 17)
          print('diff 1:')
          print(goals_ij, goals_ij.shape)
          print('diff 2:')
          print(goals_ji, goals_ji.shape)
        
          diff 1:
          [1.56857698 2.07895304 2.04293421 ... 2.22588423 2.08634941 2.20760956] (4000,)
          diff 2:
          [0.1226164  0.0446999  0.78191971 ... 0.29486617 0.50708817 0.3921155 ] (4000,)
        
  • Second, implement get_probs_winner(team1, team2), which uses get_diffs(team_i, team_j) inside:

    • Sample from a poisson distribution the number of goals for each value in$ diff_{ij} $ and each value in$ diff_{ji} $ to get the number of goals scored: $ goals_{ij} = Poisson \left(\lambda = (diff_{ij}) \right) $ $ goals_{ji} = Poisson \left(\lambda = (diff_{ji}) \right) $
    • Then compare the each$ goals_{ij} $ an$ goals_{ji} $.
    • Example output:

      •   get_probs_winner(0, 17)
        
        FC Bayern München : Hamburger SV
        (prob team1, prob team2, tie)
        (0.80625, 0.03675, 0.157)
        

All functions you need were already presented in this notebook, e.g.:

  • trace.get_values("offence") to retrieve the concrete values of the samples as np-array
  • scipy.stats.poisson.pmf(k, lambda_) or np.random.poisson(lambda_) to sample
def get_diffs(team_1, team_2):
    
    raise NotImplementedError()
    
    return diff_ij, diff_ji
diff_ij, diff_ji = get_diffs(0, 17)
print('diff 1:')
print(diff_ij, diff_ij.shape)
print('diff 2:')
print(diff_ji, diff_ji.shape)
def get_probs_winner(team1, team2):
    print (clubs.at[team1, "Name"],":",clubs.at[team2, "Name"])
    
    raise NotImplementedError()
    
    print('(prob team1, prob team2, tie)')
    return p1, p2, tie
get_probs_winner(0, 17)

Distribution of Expected Goals

Task:

Use the model and the sampling trace to predict how many goals a teams scores agains another team.

What is the expected number of the goals each team scores?

Implement the corresponding python (plot) functions.

For team with index 0 (Bayern München) vs . team with index 17 (Hamburg), the plot (a poisson distribution) should look similar to the follwowing:

internet connection needed

Hints:

  • Implement get_goal_distribution(diff, max_goals=20) which:

    • takes as input parameter one of the outputs of get_diffs(team_i, team_j)
    • and outputs poisson distribution for the probabilities that the corresponing team scored 0, 1, ..., or max_goals: *$ goals_{ij} = Poisson \left(\lambda = (diff_{ij}) \right) $
    • example output:

      •   poisson_goals_1 = get_goal_distribution(diff_ij)
          print('poisson_goals_1 1:')
          print(poisson_goals_1)
        
          poisson_goals_1 1:
          [1.85886410e-01 2.99635677e-01 2.54125262e-01 1.50583060e-01
          6.99165158e-02 2.70688509e-02 9.08722901e-03 2.71762207e-03
          7.38568964e-04 1.85245582e-04 4.34180830e-05 9.60812360e-06
          2.02488505e-06 4.09324980e-07 7.98263618e-08 1.50845350e-08
          2.77028983e-09 4.95305584e-10 8.62693285e-11 1.46354975e-11]
        
  • Implement plot_goal_diffs(team_1, team_2):

    • combines the two functions get_diffs(team_i, team_j) and get_goal_distribution(diff, max_goals=20) and plots the distributions, shown in the cell above.
def get_goal_distribution(diff, max_goals=20):
    
    raise NotImplementedError()
    
    return poisson_goals
poisson_goals_1 = get_goal_distribution(diff_ij)
print('poisson_goals_1 1:')
print(poisson_goals_1)
poisson_goals_2 = get_goal_distribution(diff_ji)
print('poisson_goals_2 2:')
print(poisson_goals_2)
def plot_goal_diffs(team_1, team_2):
    
    raise NotImplementedError()
    
# probability that team 0 scores 0,1,2, ... goals against team 8
plot_goal_diffs(0, 17)

Extension of the Model

Task:

Extend the model with home advantage:

At home a team is in general a little bit stronger as away. Modify the model to take this into account. This could e.g. be modeled with another normal distribution:

for home games: $ goals_{ij} = Poisson \left(\lambda = (offence_i-defence_j + \frac{advantage}{2}) \right) $

and games as guest: $ goals_{ij} = Poisson \left(\lambda = (offence_i-defence_j - \frac{advantage}{2}) \right) $

And plot the posterior home advantage distribution:

internet connection needed

Notes:

  • Note there is no index$ i $ or$ j $ at the PDF$ advantage $. For now it is eassier to assume there is a general advantage which is equal for all teams.
######################
### YOUR CODE HERE ###
######################

Summary and Further Modifications

In this exercise you modeled a real world scenario with a bayesian model and tuned the prior probabilites with sampling giving posterior distributions. Remember that this was just one way of modelling the real world scenario, our model could also have been totally different.

A further extension for example could be to include not a single home advange distribution, but one for each team (like we also modeled individual offence and defence strenghts). Or you could also add individual away disadvantes.

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise - Bundesliga Game Prediction with PyMC3 by Christian Herta, Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christian Herta, Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.