Exercise - Probabilistic Rankings

Introduction

In this assignment, you’ll be using the (binary) results of the 2011 ATP men’s tennis singles for 107 players in a total of 1801 games (which these players played against each other in the 2011 season), to compute probabilistic rankings of the skills of these players.

Remark: In order to detect errors in your own code, execute the notebook cells containing assert or assert_almost_equal. These statements raise exceptions, as long as the calculated result is not yet correct.

Requirements

Knowledge

To complete this exercise notebook, you should possess knowledge about the following topics.

Python Modules

import numpy as np
import pymc3 as pm
import theano

from theano import tensor as T
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize

%matplotlib inline

Theory - Bernoulli Distribution

Later in this notebook we will make use of the Bernoulli distribution (besides normal distribution).

The Bernoulli distribution is for discrete variables, hence it is a probability mass function (PMF), specifically, there are only two possible outcomes:$ n=0 $ and$ n=1 $. $ n=1 $ is often defined as "success" or "positive" and occurs with probability$ p $ with$ 0<p<1 $, whereas$ n=0 $ is interpreted as "failure" or "negative" and occurs with probability$ q = 1 - p $

Bernoulli Example with PyMC3

We observe 10 coin tosses: *$ tosses = [0,1,0,0,0,0,0,1,0,0] $

We have a slight feeling that it is not a fair coin, so we gonna build a model in PyMC3 and sample from it. There are differnt ways to modell a coin toss experiment, we make the following approach:

  • Normal distribution for$ heads $ and$ number $

$ heads \sim \mathcal N(\mu=0, \tau=1) \\ number \sim \mathcal N(\mu=0, \tau=1) $

  • Their difference decides on the chances. We assume it might not be a fair coin, so the difference could be intepreted as difference in weight of each side.

  • Further we are interested in a % chance for$ heads $ :$ number $, so we will use the logistic function, which maps the output to$ ]0,1[ $:

$ logistic(diff) = \frac{1}{1+\exp(-diff)} \\ $

  • Bernoulli distribution for our observatons.
### observed data
tosses = [0,1,0,0,0,0,0,1,0,0]

### PyMC3 uses Theano internally, so we define the lostic function
### with Theano
def theano_logistic(diff):
    return 1/(1+theano.tensor.exp(-diff))
model = pm.Model()

with model:
    heads = pm.Normal("heads", mu=0, sigma=1)
    number = pm.Normal("number", mu=0, sigma=1)
    
    ### Since diff is only a single Tensor we can directly use 
    ### it as argument for pm.Bernoulli()
    diff = theano_logistic(heads-number)
    prediction = pm.Bernoulli("prediction", observed=tosses, p=diff)
nb_samples = 10000
tunes = nb_samples // 10

with model:
    trace = pm.sample(draws=nb_samples, tune=tunes)  
### We cna now retrieve our samples from the data, an e.g. plot the mean:
print('Mean of our sampled normal distributions:')
print(trace.get_values('heads').mean())
print(trace.get_values('number').mean())

### Or we plot the smoothed posterior normal distribution based on our sampling
_ = pm.traceplot(trace['heads'])
_ = pm.traceplot(trace['number'])

Data

If you have not cloned the whole git directory. Download the files:

and adjust the paths of the variables tennis_playes and tennis_games.

If you have clones the whole git repository you can just execute the cells.

tennis_players = np.load("../../data/tennis_players.npy")
nb_tennis_players = len(tennis_players)
print(tennis_players.shape)
print(tennis_players[0:5])
tennis_games = np.load("../../data/tennis_games.npy")
print(tennis_games.shape)
print(tennis_games[0:5])

tennis_playes contains a list, where the list index equals the player identity.

tennis_games is a 1801 by 2 matrix of the played games, one row per game: the first column is the identity of the player who won the game, and the second column contains the identity of the player who lost.

Usage example:

for i in [0,1,2]:
    print('Game number ', i)
    print(tennis_players[tennis_games[i,0]], ' won against ', tennis_players[tennis_games[i,1]])

Before we proceed with the exercises, we will reduce the dataset size from ~1800 games to 200 games only. Processing all the data with PyMC3 might take really long, which is not very comfortable while trying to solve some task.

### Reducing dataset size
tennis_games = tennis_games[:200]
tennis_players_tmp = {}
for g in tennis_games:
    tennis_players_tmp[g[0]] = tennis_players[g[0]]
    tennis_players_tmp[g[1]] = tennis_players[g[1]]
tennis_players = tennis_players_tmp
nb_tennis_players = len(tennis_players.keys())
tennis_players = [tennis_players[i] for i in range(nb_tennis_players)]

Exercises

  1. Use pymc to develop a ranking system.
  2. Plot the ranking according to your (learnt) model.
  3. Write a function which get's as input the ids of two player and prints (or returns) a prediction of the probabilities that player 1 resp. player 2 wins. e.g.:
> print_prediction(10, 12)   
Andy-Murray : David-Nalbandian
   0.607 : 0.393

Porbabilistic Model

For our model we can assume that each player$ i $ has a skill, which can (like always) be modelled with a normal (Gaussian) distribution:

$ skill_i \sim \mathcal N(\mu=0, \tau=1) $

If we subtract the skill of a player$ j $ from the skill of player$ i $, we get another distribution, the diffrence of their skills:

$ diff_{ij} = skill_i - skill_j $

Since we want probabilities that player$ i $ wins against player$ j $ it makes sense to put the result into a logistic function

$ logistic(diff_{ij}) = \frac{1}{1+\exp(-z)} \\ \\ \text{with } logistic(diff_{ij}) \in ]0,1[ $

Because it is defined between$ 0 $ and$ 1 $, the result can directly be interpreted as the probability of player$ i $ winning.

Our observation can now be modelled as Bernoulli distribution with the observed probabilities (param p) as$ logistic(diff_{ij}) $.

Task:

Build the model in PyMC3 and sample from it. Best advice is to just start and only have a look at the Hints when you are stuck. Do not get confused by the Hints.

Hints:

Things to consider:

  • You have more than 2 players / normal distributions (see coin toss example).
  • The list of games only includes wins.
  • You have a$ diff_{ij} = skill_i - skill_j $ for each game played.
  • When passing a list of tensors to the p-argument of pm.Bernoulli(), you might need the functions:

    • theano.tensor.stack()

Plot the Skill

Task:

Write a function which prints the sorted mean skill of each player, similar to the following example:

> print_scoring()
Rafael-Nadal                   1.321
Novak-Djokovic                 1.187
Jo-Wilfried-Tsonga             0.811
Juan-Martin-Del-Potro          0.631
Roger-Federer                  0.587
Florian-Mayer                  0.487
Mardy-Fish                     0.369
Adrian-Mannarino               0.332
Tomas-Berdych                  0.327
Andy-Roddick                   0.249
David-Ferrer                   0.241
James-Blake                    0.239
Marcel-Granollers              0.194
Ivan-Dodig                     0.186
Stanislas-Wawrinka             0.157
Robin-Haase                    0.149
Fernando-Verdasco              0.143
Andy-Murray                    0.119
Marcos-Baghdatis               0.112
Nikolay-Davydenko              0.110
Mikhail-Youzhny                0.043
Ernests-Gulbis                 0.040
Juan-Monaco                    0.003
Feliciano-Lopez                -0.019
Bernard-Tomic                  -0.023

Optionally also make a bar plot, which could look similiar to the following (here only 13 players are shown).

internet connection needed

######################
### YOUR CODE HERE ###
######################

Make Predictions

Task:

Write a function which get's as input the ids of two player and prints (or returns) a prediction of the probabilities that player 1 resp. player 2 wins. e.g.:

> print_prediction(10, 12)   
Andy-Murray : David-Nalbandian
   0.607 : 0.393
######################
### YOUR CODE HERE ###
######################

Literature

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

Exercise - Probabilistic Rankings by Christian Herta
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Christian Herta

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.