Something that gets to many a footy fan, is the feeling that your team has won the game in most areas expect on the scoreboard.

Thinking about this statement a little bit deeper has the following implication. That there are some areas of the game, that if you win, you tend to win the game.

We can think about this by putting it into a model.

The data we will use here is the game differentials of the various game statistics that are available using fitzRoy

The general idea is its fairly obvious if you score more than the opposition you win. Shocking I know! What we are trying to do here, is come up with a concept of we won other things but not the scoreboard, usually when we win these things we tend to win a game.

So we know thanks to fitzRoy we have access to the data on footywire, which has a few extra variables that are not on afltables.

So imagine this scenario, you don’t have access to the scores, only the in game statistics as per these pages, what variables/differentials would you look at to decide to will win?

We know that for example, looking only at free kick differential isn’t very predictive. Is there a combination of differentials that when a team wins they are more likely to win? Lets build a model to find out.

The model I will use here, is a logistic regression model with the binary outcome being win/loss.

Step One Build the dataset

library(tidyverse)

## -- Attaching packages -------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ----------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

fitzRoy::player_stats%>%
    filter(Season>2014)%>%
    select(-Player)%>%
    group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>%
    summarise_all(.funs = sum)%>%
group_by(Match_id)%>%
  arrange(Match_id)%>%
  mutate(diff_cp=c(-diff(CP), diff(CP)))

## # A tibble: 1,462 x 43
## # Groups:   Match_id [731]
##    Date       Season Round  Team    Status Opposition Venue Match_id    CP
##    <date>      <dbl> <chr>  <chr>   <chr>  <chr>      <chr>    <dbl> <int>
##  1 2015-04-02   2015 Round~ Carlton Home   Richmond   MCG       5964   127
##  2 2015-04-02   2015 Round~ Richmo~ Away   Carlton    MCG       5964   121
##  3 2015-04-04   2015 Round~ Gold C~ Away   Melbourne  MCG       5965   136
##  4 2015-04-04   2015 Round~ Melbou~ Home   Gold Coast MCG       5965   150
##  5 2015-04-04   2015 Round~ Essend~ Away   Sydney     ANZ ~     5966   150
##  6 2015-04-04   2015 Round~ Sydney  Home   Essendon   ANZ ~     5966   176
##  7 2015-04-04   2015 Round~ Brisba~ Home   Collingwo~ Gabba     5967   152
##  8 2015-04-04   2015 Round~ Collin~ Away   Brisbane   Gabba     5967   165
##  9 2015-04-04   2015 Round~ West C~ Away   Western B~ Etih~     5968   137
## 10 2015-04-04   2015 Round~ Wester~ Home   West Coast Etih~     5968   156
## # ... with 1,452 more rows, and 34 more variables: UP <int>, ED <int>,
## #   DE <dbl>, CM <int>, GA <int>, MI5 <int>, One.Percenters <int>,
## #   BO <int>, TOG <int>, K <int>, HB <int>, D <int>, M <int>, G <int>,
## #   B <int>, T <int>, HO <int>, GA1 <int>, I50 <int>, CL <int>, CG <int>,
## #   R50 <int>, FF <int>, FA <int>, AF <int>, SC <int>, CCL <int>,
## #   SCL <int>, SI <int>, MG <int>, TO <int>, ITC <int>, T5 <int>,
## #   diff_cp <int>

Something to remember with fitzRoy is that footywire doesn’t have tackles inside 50, meters gained etc for games previous to 2015, so this becomes our first filter

The next thing we do, is we select all the columns except for Player, the reason being is that we don’t need it and also its nice to see how to “deselect” columns as well as select them.

Remember we want data at a game level for each team and we want to be able to come up with the in game differential, this is where our next group_by comes in handy and its related to the summarise_all.

You might be thinking jeez mate that’s a lot to group by can’t you just use Match_id as they are already unique. Yes that is true, I could have, but one thing about summarise_all is that is summarises every column that is not in the group_by. You can test this out by running below, it should spit out an error message. Error in summarise_impl(.data, dots) : Evaluation error: invalid 'type' (character) of argument.

fitzRoy::player_stats%>%
    filter(Season>2014)%>%
    select(-Player)%>%
    group_by( Match_id)%>%
    summarise_all(.funs = sum)

After the summarise_all, we group by Match_id so we can find the differentials by Match_id so we can come up with as an example the contested possession differential diff_cp. I like to arrange the dataset so I can do sanity checks vs footywire.

So the next thing you might be thinking, is jeez doing that for all variables finding their differentials seems a bit tedious.

That is where we can use mutate_if, we can just mutate all columns that is.numeric and is not in the group_by which is another way we could have done original summarise_all instead we could have used summarise_if BUT match_id is numeric!

You can check this by using

str(fitzRoy::player_stats)

## 'data.frame':    76296 obs. of  43 variables:
##  $ Date          : Date, format: "2010-03-25" "2010-03-25" ...
##  $ Season        : num  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Round         : chr  "Round 1" "Round 1" "Round 1" "Round 1" ...
##  $ Venue         : chr  "MCG" "MCG" "MCG" "MCG" ...
##  $ Player        : chr  "Daniel Connors" "Daniel Jackson" "Brett Deledio" "Ben Cousins" ...
##  $ Team          : chr  "Richmond" "Richmond" "Richmond" "Richmond" ...
##  $ Opposition    : chr  "Carlton" "Carlton" "Carlton" "Carlton" ...
##  $ Status        : chr  "Home" "Home" "Home" "Home" ...
##  $ Match_id      : num  5089 5089 5089 5089 5089 ...
##  $ CP            : int  8 11 7 9 8 6 7 7 6 7 ...
##  $ UP            : int  15 10 14 10 10 12 10 6 7 5 ...
##  $ ED            : int  16 14 16 11 13 16 13 7 10 7 ...
##  $ DE            : num  66.7 60.9 76.2 57.9 68.4 88.9 76.5 50 76.9 53.8 ...
##  $ CM            : int  0 1 0 0 1 0 0 0 1 0 ...
##  $ GA            : int  0 0 0 1 0 0 0 1 0 0 ...
##  $ MI5           : int  0 0 0 0 0 0 0 2 0 0 ...
##  $ One.Percenters: int  1 0 0 0 0 1 5 2 5 1 ...
##  $ BO            : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ TOG           : int  69 80 89 69 77 81 84 80 100 88 ...
##  $ K             : int  14 11 12 13 11 5 7 9 6 7 ...
##  $ HB            : int  10 12 9 6 8 13 10 5 7 6 ...
##  $ D             : int  24 23 21 19 19 18 17 14 13 13 ...
##  $ M             : int  3 2 5 1 6 4 2 3 4 2 ...
##  $ G             : int  0 0 1 1 0 0 0 1 0 0 ...
##  $ B             : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ T             : int  1 5 6 1 1 3 2 5 4 4 ...
##  $ HO            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GA1           : int  0 0 0 1 0 0 0 1 0 0 ...
##  $ I50           : int  2 8 4 1 2 2 1 5 0 3 ...
##  $ CL            : int  2 5 3 2 3 3 4 4 1 1 ...
##  $ CG            : int  4 4 4 3 3 1 2 0 2 0 ...
##  $ R50           : int  6 1 3 4 2 0 2 0 3 1 ...
##  $ FF            : int  2 2 1 1 0 0 1 4 1 1 ...
##  $ FA            : int  0 0 2 0 2 1 0 0 0 0 ...
##  $ AF            : int  77 85 94 65 65 62 56 77 61 56 ...
##  $ SC            : int  85 89 93 70 63 72 79 73 68 59 ...
##  $ CCL           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SCL           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SI            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ MG            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TO            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ITC           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ T5            : int  NA NA NA NA NA NA NA NA NA NA ...

So hopefully you have seen a couple of short cuts that come from knowing the dataset and this is an example of why its important to do your checks!

Here is a great a guide to bad data which provides a nice series of steps in how to check things that might commonly go wrong.

fitzRoy::player_stats%>%
         filter(Season>2014)%>%
         select(-Player,-Date)%>%
         group_by(Season, Round, Team, Status, Opposition, Venue, Match_id)%>%
         summarise_all(.funs = sum)%>%
         group_by(Match_id)%>%
         arrange(Match_id)%>%
         mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.))))

## # A tibble: 1,462 x 76
## # Groups:   Match_id [731]
##    Season Round  Team   Status Opposition Venue Match_id    CP    UP    ED
##     <dbl> <chr>  <chr>  <chr>  <chr>      <chr>    <dbl> <int> <int> <int>
##  1   2015 Round~ Carlt~ Home   Richmond   MCG       5964   127   199   233
##  2   2015 Round~ Richm~ Away   Carlton    MCG       5964   121   227   273
##  3   2015 Round~ Gold ~ Away   Melbourne  MCG       5965   136   165   203
##  4   2015 Round~ Melbo~ Home   Gold Coast MCG       5965   150   198   246
##  5   2015 Round~ Essen~ Away   Sydney     ANZ ~     5966   150   171   226
##  6   2015 Round~ Sydney Home   Essendon   ANZ ~     5966   176   192   248
##  7   2015 Round~ Brisb~ Home   Collingwo~ Gabba     5967   152   211   246
##  8   2015 Round~ Colli~ Away   Brisbane   Gabba     5967   165   179   231
##  9   2015 Round~ West ~ Away   Western B~ Etih~     5968   137   236   278
## 10   2015 Round~ Weste~ Home   West Coast Etih~     5968   156   214   275
## # ... with 1,452 more rows, and 66 more variables: DE <dbl>, CM <int>,
## #   GA <int>, MI5 <int>, One.Percenters <int>, BO <int>, TOG <int>,
## #   K <int>, HB <int>, D <int>, M <int>, G <int>, B <int>, T <int>,
## #   HO <int>, GA1 <int>, I50 <int>, CL <int>, CG <int>, R50 <int>,
## #   FF <int>, FA <int>, AF <int>, SC <int>, CCL <int>, SCL <int>,
## #   SI <int>, MG <int>, TO <int>, ITC <int>, T5 <int>,
## #   Season_difference <dbl>, CP_difference <int>, UP_difference <int>,
## #   ED_difference <int>, DE_difference <dbl>, CM_difference <int>,
## #   GA_difference <int>, MI5_difference <int>,
## #   One.Percenters_difference <int>, BO_difference <int>,
## #   TOG_difference <int>, K_difference <int>, HB_difference <int>,
## #   D_difference <int>, M_difference <int>, G_difference <int>,
## #   B_difference <int>, T_difference <int>, HO_difference <int>,
## #   GA1_difference <int>, I50_difference <int>, CL_difference <int>,
## #   CG_difference <int>, R50_difference <int>, FF_difference <int>,
## #   FA_difference <int>, AF_difference <int>, SC_difference <int>,
## #   CCL_difference <int>, SCL_difference <int>, SI_difference <int>,
## #   MG_difference <int>, TO_difference <int>, ITC_difference <int>,
## #   T5_difference <int>

Step Two - Joining Datasets

Now we could just sum up the goals by each player and their behinds, but this would miss out on rushed behinds. So lets join our footywire dataset with the afltables match results

df<-fitzRoy::player_stats%>%
         filter(Season>2014)%>%
         select(-Player)%>%
         group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>%
         summarise_all(.funs = sum)%>%
         group_by(Match_id)%>%
         arrange(Match_id)%>%
         mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.))))

df2<-fitzRoy::match_results
df2<-df2%>%filter(Season>2014)
df3<-select(df2, Date, Round, Home.Team, Home.Points)
df4<-select(df2, Date, Round, Away.Team, Away.Points)
colnames(df3)[3]<-"Team"
colnames(df3)[4]<-"Points"
colnames(df4)[3]<-"Team"
colnames(df4)[4]<-"Points"


df5<-rbind(df4,df3)



df5<-df5 %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane"))

df5<-df5 %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs"))


df6<-inner_join(df,df5, by=c("Team","Date"))

dataset_columns <- c(1,2,4,6,7,8,44:80,81)
dataset<-df6%>%group_by(Match_id)%>%
  arrange(Match_id)%>%
  mutate(Margin=c(-diff(Points), diff(Points)))%>%
  mutate(Win_loss=if_else(Margin>0,1,0,NULL))%>%
  select(dataset_columns)

Taking Submissions.

I asked twitter, what areas do you think are important to win, to win a game. I got a response and here we go!

So from our above script, all we need to be able to add in is goal accuracy as a predictor or goal accuracy differential?

df<-fitzRoy::player_stats%>%
         filter(Season>2014)%>%
         select(-Player)%>%
         group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>%
         summarise_all(.funs = sum)%>%
         group_by(Match_id)%>%
         arrange(Match_id)%>%
         mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.))))

df2<-fitzRoy::match_results
df2<-df2%>%filter(Season>2014)
df3<-select(df2, Date, Round, Home.Team, Home.Points,Home.Goals,Home.Behinds)
df4<-select(df2, Date, Round, Away.Team, Away.Points,Away.Goals,Away.Behinds)
df3$Accuracy<-(df3$Home.Goals/(df3$Home.Goals+df3$Home.Behinds))
colnames(df3)[3]<-"Team"
colnames(df3)[4]<-"Points"
colnames(df3)[5]<-"Goals"
colnames(df3)[6]<-"Behinds"
df4$Accuracy<-(df4$Away.Goals/(df4$Away.Goals+df4$Away.Behinds))
colnames(df4)[3]<-"Team"
colnames(df4)[4]<-"Points"
colnames(df4)[5]<-"Goals"
colnames(df4)[6]<-"Behinds"


df5<-rbind(df4,df3)



df5<-df5 %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane"))

df5<-df5 %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs"))

df6<-inner_join(df,df5, by=c("Team","Date"))


dataset_columns <- c(1,2,4,6,7,8,44:77,82:84)
dataset<-df6%>%group_by(Match_id)%>%
  arrange(Match_id)%>%
  mutate(Margin=c(-diff(Points), diff(Points)))%>%
  mutate(Win_loss=if_else(Margin>0,1,0,NULL))%>%
  select(dataset_columns)

Building the Logistic Regression Model

library(aod)
library(ordinal)

## 
## Attaching package: 'ordinal'

## The following object is masked from 'package:dplyr':
## 
##     slice

library(lme4)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

## 
## Attaching package: 'lme4'

## The following objects are masked from 'package:ordinal':
## 
##     ranef, VarCorr

in.sample  <- subset(dataset, Season %in% c(2015:2017))
#in.sample  <- subset(mydata, year ==2008)
out.sample <- subset(dataset, Season == 2018)

in.sample$Win_loss <- factor(in.sample$Win_loss)

out.sample$Win_loss<-factor(out.sample$Win_loss)

To know which columsn we want to scale an easy way is to go names(dataset) which should print out the column names.

temp1<-scale(in.sample[,7:40])
in.sample[,7:40]<-temp1
#attributes(temp1)
temp1.center<-attr(temp1,"scaled:center")
temp1.scale<-attr(temp1,"scaled:scale")

m <- glm(Win_loss ~I50_difference+
           Accuracy+
           R50_difference+
           CCL_difference+
           SCL_difference+
           MI5_difference , data = in.sample, family =binomial)
          
summary(m)

## 
## Call:
## glm(formula = Win_loss ~ I50_difference + Accuracy + R50_difference + 
##     CCL_difference + SCL_difference + MI5_difference, family = binomial, 
##     data = in.sample)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.43728  -0.28649  -0.00212   0.26964   2.63087  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -3.45340    0.58793  -5.874 4.26e-09 ***
## I50_difference  7.17314    0.51782  13.853  < 2e-16 ***
## Accuracy        6.43892    1.09304   5.891 3.84e-09 ***
## R50_difference  5.11442    0.38536  13.272  < 2e-16 ***
## CCL_difference  0.16421    0.11256   1.459    0.145    
## SCL_difference  0.04502    0.10581   0.425    0.671    
## MI5_difference  0.87950    0.16978   5.180 2.22e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1718.92  on 1239  degrees of freedom
## Residual deviance:  620.62  on 1233  degrees of freedom
## AIC: 634.62
## 
## Number of Fisher Scoring iterations: 7

newdata   <- out.sample[ , -ncol(out.sample)]


newdata[,7:40]<-scale(newdata[,7:40],center=temp1.center,scale=temp1.scale) 



pre.dict    <- predict(m,newdata=newdata, type="response")
pre.dict.m  <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata)))
colnames(pre.dict.m) <- c("prob.win")


newdata.pred  <- cbind.data.frame(newdata, pre.dict.m)
newdata.pred%>%
  select(Team, Opposition, Venue,  Margin, prob.win)%>%
  filter(Margin<0)%>%
  arrange(desc(prob.win))%>%
  top_n(10)

## Selecting by prob.win

##               Team      Opposition            Venue Margin  prob.win
## 1           Sydney North Melbourne              SCG     -2 0.9368923
## 2           Sydney        Adelaide              SCG    -10 0.8578722
## 3         Essendon         Carlton              MCG    -13 0.8239973
## 4       Gold Coast        St Kilda Metricon Stadium     -2 0.7796465
## 5         Essendon       Fremantle    Optus Stadium    -16 0.5778875
## 6         St Kilda      West Coast    Optus Stadium    -13 0.5625316
## 7  North Melbourne        Richmond   Etihad Stadium    -10 0.4065246
## 8    Port Adelaide        Hawthorn     UTAS Stadium     -3 0.4063489
## 9         Brisbane     Collingwood            Gabba     -7 0.3765427
## 10        Adelaide   Port Adelaide    Adelaide Oval     -5 0.3601077

The good thing about having a template sorted out, is that you can make quick changes as you think of other variables you want to test.

For example JT asked about kicks, inside 50s, marks in 50, tackles in 50 and meters gained.

Well lets look at his list of games

m <- glm(Win_loss ~I50_difference+
           K_difference+
          I50_difference+
           MI5_difference+
           T5_difference+
           MG_difference, data = in.sample, family =binomial)
          
summary(m)

## 
## Call:
## glm(formula = Win_loss ~ I50_difference + K_difference + I50_difference + 
##     MI5_difference + T5_difference + MG_difference, family = binomial, 
##     data = in.sample)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.68051  -0.29449  -0.00427   0.29706   2.71729  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.05097    0.10101  -0.505 0.613808    
## I50_difference -0.50237    0.17403  -2.887 0.003892 ** 
## K_difference    0.47670    0.15356   3.104 0.001907 ** 
## MI5_difference  0.63284    0.17320   3.654 0.000258 ***
## T5_difference  -0.03405    0.11939  -0.285 0.775503    
## MG_difference   4.19630    0.31214  13.444  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1718.92  on 1239  degrees of freedom
## Residual deviance:  631.05  on 1234  degrees of freedom
## AIC: 643.05
## 
## Number of Fisher Scoring iterations: 7

newdata   <- out.sample[ , -ncol(out.sample)]


newdata[,7:40]<-scale(newdata[,7:40],center=temp1.center,scale=temp1.scale) 



pre.dict    <- predict(m,newdata=newdata, type="response")
pre.dict.m  <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata)))
colnames(pre.dict.m) <- c("prob.win")


newdata.pred  <- cbind.data.frame(newdata, pre.dict.m)
newdata.pred%>%
  select(Team, Opposition, Venue,  Margin, prob.win)%>%
  filter(Margin<0)%>%
  arrange(desc(prob.win))%>%
  top_n(10)

## Selecting by prob.win

##                Team      Opposition          Venue Margin  prob.win
## 1            Sydney        Adelaide            SCG    -10 0.9441431
## 2          Essendon         Carlton            MCG    -13 0.9373812
## 3          Brisbane      Gold Coast          Gabba     -5 0.8991898
## 4  Western Bulldogs          Sydney Etihad Stadium     -7 0.7757990
## 5       Collingwood         Geelong            MCG    -21 0.7610453
## 6     Port Adelaide         Geelong  Adelaide Oval    -34 0.6678467
## 7            Sydney North Melbourne            SCG     -2 0.5908486
## 8           Geelong      West Coast  Optus Stadium    -15 0.5313548
## 9          Hawthorn      West Coast Etihad Stadium    -15 0.4912265
## 10              GWS          Sydney            SCG    -16 0.3831652

Another example Troy Wheatley.

m <- glm(Win_loss ~I50_difference+
           MG_difference+
           CP_difference+
           CM_difference+
           CCL_difference+
           CL_difference+
           HO_difference+
           ITC_difference+
           D_difference+
           CG_difference+
           R50_difference+
           One.Percenters_difference, 
         data = in.sample, family =binomial)
          
summary(m)

## 
## Call:
## glm(formula = Win_loss ~ I50_difference + MG_difference + CP_difference + 
##     CM_difference + CCL_difference + CL_difference + HO_difference + 
##     ITC_difference + D_difference + CG_difference + R50_difference + 
##     One.Percenters_difference, family = binomial, data = in.sample)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.09976  -0.24111  -0.00167   0.23880   3.13712  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -0.05870    0.10841  -0.541  0.58818    
## I50_difference             3.63170    0.68743   5.283 1.27e-07 ***
## MG_difference              3.06213    0.41218   7.429 1.09e-13 ***
## CP_difference             -0.13434    0.16628  -0.808  0.41915    
## CM_difference              0.18235    0.13033   1.399  0.16176    
## CCL_difference             0.40136    0.14838   2.705  0.00683 ** 
## CL_difference              0.06447    0.19133   0.337  0.73613    
## HO_difference              0.15043    0.11475   1.311  0.18988    
## ITC_difference             0.40376    0.18444   2.189  0.02859 *  
## D_difference              -0.12950    0.17347  -0.747  0.45534    
## CG_difference             -0.54681    0.12910  -4.235 2.28e-05 ***
## R50_difference             3.20051    0.48150   6.647 2.99e-11 ***
## One.Percenters_difference -0.07831    0.12351  -0.634  0.52606    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1718.92  on 1239  degrees of freedom
## Residual deviance:  556.62  on 1227  degrees of freedom
## AIC: 582.62
## 
## Number of Fisher Scoring iterations: 7

newdata   <- out.sample[ , -ncol(out.sample)]


newdata[,7:40]<-scale(newdata[,7:40],center=temp1.center,scale=temp1.scale) 



pre.dict    <- predict(m,newdata=newdata, type="response")
pre.dict.m  <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata)))
colnames(pre.dict.m) <- c("prob.win")


newdata.pred  <- cbind.data.frame(newdata, pre.dict.m)
newdata.pred%>%
  select(Team, Opposition, Venue,  Margin, prob.win)%>%
  filter(Margin<0)%>%
  arrange(desc(prob.win))%>%
  top_n(10)

## Selecting by prob.win

##               Team      Opposition            Venue Margin  prob.win
## 1         Essendon         Carlton              MCG    -13 0.9225119
## 2           Sydney        Adelaide              SCG    -10 0.8943902
## 3           Sydney North Melbourne              SCG     -2 0.8911554
## 4         Adelaide       Fremantle    Optus Stadium     -3 0.5566460
## 5      Collingwood         Geelong              MCG    -21 0.5081333
## 6              GWS          Sydney              SCG    -16 0.4832603
## 7       Gold Coast        St Kilda Metricon Stadium     -2 0.4386997
## 8  North Melbourne        Richmond   Etihad Stadium    -10 0.4137897
## 9         Hawthorn      West Coast   Etihad Stadium    -15 0.3802405
## 10        Brisbane   Port Adelaide    Adelaide Oval     -5 0.3625098

We have an idea from insightlane

df<-fitzRoy::player_stats%>%
         filter(Season>2014)%>%
         select(-Player)%>%
         group_by(Date,Season, Round, Team, Status, Opposition, Venue, Match_id)%>%
         summarise_all(.funs = sum)%>%
  mutate(disposaltoturnover=D/TO)%>%
         group_by(Match_id)%>%
         arrange(Match_id)%>%
         mutate_if(is.numeric, funs(difference=c(-diff(.), diff(.))))

df2<-fitzRoy::match_results
df2<-df2%>%filter(Season>2014)
df3<-select(df2, Date, Round, Home.Team, Home.Points,Home.Goals,Home.Behinds)
df4<-select(df2, Date, Round, Away.Team, Away.Points,Away.Goals,Away.Behinds)
df3$Accuracy<-(df3$Home.Goals/(df3$Home.Goals+df3$Home.Behinds))
colnames(df3)[3]<-"Team"
colnames(df3)[4]<-"Points"
colnames(df3)[5]<-"Goals"
colnames(df3)[6]<-"Behinds"
df4$Accuracy<-(df4$Away.Goals/(df4$Away.Goals+df4$Away.Behinds))
colnames(df4)[3]<-"Team"
colnames(df4)[4]<-"Points"
colnames(df4)[5]<-"Goals"
colnames(df4)[6]<-"Behinds"


df5<-rbind(df4,df3)



df5<-df5 %>%mutate(Team = str_replace(Team, "Brisbane Lions", "Brisbane"))

df5<-df5 %>%mutate(Team = str_replace(Team, "Footscray", "Western Bulldogs"))

df6<-inner_join(df,df5, by=c("Team","Date"))


dataset_columns <- c(1,2,4,6,7,8,45:79,81:86)
dataset<-df6%>%group_by(Match_id)%>%
  arrange(Match_id)%>%
  mutate(Margin=c(-diff(Points), diff(Points)))%>%
  mutate(Win_loss=if_else(Margin>0,1,0,NULL))%>%
  select(dataset_columns)

in.sample  <- subset(dataset, Season %in% c(2015:2017))
#in.sample  <- subset(mydata, year ==2008)
out.sample <- subset(dataset, Season == 2018)

in.sample$Win_loss <- factor(in.sample$Win_loss)

out.sample$Win_loss<-factor(out.sample$Win_loss)


temp1<-scale(in.sample[,7:41])
in.sample[,7:41]<-temp1
#attributes(temp1)
temp1.center<-attr(temp1,"scaled:center")
temp1.scale<-attr(temp1,"scaled:scale")


m <- glm(Win_loss ~K_difference+
           MG_difference+
           disposaltoturnover_difference,
         data = in.sample, family =binomial)
          
summary(m)

## 
## Call:
## glm(formula = Win_loss ~ K_difference + MG_difference + disposaltoturnover_difference, 
##     family = binomial, data = in.sample)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.63495  -0.30591  -0.00522   0.30866   2.67106  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -0.04937    0.09940  -0.497    0.619    
## K_difference                   0.70737    0.17504   4.041 5.32e-05 ***
## MG_difference                  4.05987    0.29071  13.965  < 2e-16 ***
## disposaltoturnover_difference -0.13159    0.18617  -0.707    0.480    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1716.15  on 1237  degrees of freedom
## Residual deviance:  651.09  on 1234  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 659.09
## 
## Number of Fisher Scoring iterations: 7

newdata   <- out.sample[ , -ncol(out.sample)]


newdata[,7:41]<-scale(newdata[,7:41],center=temp1.center,scale=temp1.scale) 



pre.dict    <- predict(m,newdata=newdata, type="response")
pre.dict.m  <- data.frame(matrix(unlist(pre.dict), nrow= nrow(newdata)))
colnames(pre.dict.m) <- c("prob.win")


newdata.pred  <- cbind.data.frame(newdata, pre.dict.m)
newdata.pred%>%
  select(Team, Opposition, Venue,  Margin, prob.win)%>%
  filter(Margin<0)%>%
  arrange(desc(prob.win))%>%
  top_n(10)

## Selecting by prob.win

##                Team      Opposition          Venue Margin  prob.win
## 1            Sydney        Adelaide            SCG    -10 0.9249691
## 2          Essendon         Carlton            MCG    -13 0.8865205
## 3  Western Bulldogs          Sydney Etihad Stadium     -7 0.8724064
## 4          Brisbane      Gold Coast          Gabba     -5 0.8617977
## 5     Port Adelaide         Geelong  Adelaide Oval    -34 0.8094733
## 6       Collingwood         Geelong            MCG    -21 0.7254149
## 7            Sydney North Melbourne            SCG     -2 0.6606187
## 8          Hawthorn      West Coast Etihad Stadium    -15 0.5825785
## 9               GWS          Sydney            SCG    -16 0.5081239
## 10         Brisbane     Collingwood          Gabba     -7 0.3885346

Analysis of AFL

But We Won Everywhere but the Scoreboard

Step One Build the dataset

Step Two - Joining Datasets

Taking Submissions.

Building the Logistic Regression Model