As its been pointed out to me on that it would be handy if within fitzRoy that it should contain past players data from footywire.
So here is roughly how to do that.
Step 1 - get all the packages you need
library(rvest)
## Loading required package: xml2
library(tidyverse)
## ── Attaching packages ──────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::pluck() masks rvest::pluck()
library(naniar)
naniar isn’t something I have used on the blog before but I find its pretty handy. What we are doing here is we are web-scraping and we have an issue in our scrape. Basically the row in which we pluck our height, weight and position from sometimes contains height and weight but not position.
In that case when we make our data tidy, when we go to find a ‘’position’’ for a player who doesn’t have one, we get the whole row. Hopefully this will be a bit more clear with the screenshots and alike below.
Step 2 - Pick a team to scrape (this case I’m going to do West Coast)
Keeping in mind we want to scrape the webpages lets just look at a few eagles players and see how their pages are structured. To do this I don’t think we need to really dig into the html but lets just look at the pages.
First player I decided to click on was John Annear we can see we can get his date of birth, his height and weight.
Second player I decided to look at was Corey Adamson here we can see we can get his date of birth, his height, his weight, his position and his draft position.
Thirdy player I decided to look at was Mark Lecras
Here we can see we get his date of birth, games played, height, weight, position and draft position.
Then after randomly clicking on a few other players I am reasonble confident that these 3 cover the variety of different bits of information provided for all past players.
So now lets get scraping
Step 3 - Scrape a single player
For this example lets do Mark Lecras, the reason is he has the most amount of information for past players on his page.
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
page<-read_html("https://www.footywire.com/afl/footy/pp-west-coast-eagles--mark-lecras")
player<- page%>%
html_nodes("#playerProfileName")%>%
html_text()
player
## [1] "Mark Lecras"
playing.for<- page%>%
html_nodes("#playerProfileTeamDiv a b")%>%
html_text() %>% as.tibble()
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
playing.for
## # A tibble: 1 x 1
## value
## <chr>
## 1 West Coast Eagles
games<-page%>%
html_nodes("#playerProfileData1")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_extract(pattern =("(?<=Games: ).*(?=Born:)"))%>%as.tibble()
games
## # A tibble: 1 x 1
## value
## <chr>
## 1 "219 "
born<- page%>%
html_nodes("#playerProfileData1")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_remove(".*Born: ")%>%
str_squish() %>% as.tibble()
born
## # A tibble: 1 x 1
## value
## <chr>
## 1 August 30, 1986
weight<-page%>%
html_nodes("#playerProfileData2")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_extract(pattern =("(?<=Weight:).*(?=kg)"))%>%as.tibble()
weight
## # A tibble: 1 x 1
## value
## <chr>
## 1 " 82"
height<-page%>%
html_nodes("#playerProfileData2")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_extract(pattern =("(?<=Height:).*(?=cm)"))%>%as.tibble()
height
## # A tibble: 1 x 1
## value
## <chr>
## 1 " 184"
draft_position <- page%>%
html_nodes("#playerProfileDraftInfo")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_extract(pattern =("(?<=Drafted: ).*(?=by)"))%>%as.tibble()
draft_position
## # A tibble: 1 x 1
## value
## <chr>
## 1 Round 3, Pick #37 2004 National Draft
club_drafted <- page%>%
html_nodes("#playerProfileDraftInfo a+ a")%>%
html_text()%>%str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_remove(".*by") %>% as.tibble()
club_drafted
## # A tibble: 1 x 1
## value
## <chr>
## 1 West Coast Eagles
position <- page%>%
html_nodes("#playerProfileData2")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_remove(".*Position: ")%>%
str_squish() %>% as.tibble()
position
## # A tibble: 1 x 1
## value
## <chr>
## 1 Forward
player_information <- cbind.fill(player, playing.for, games,born, weight, height,draft_position, club_drafted, position)
player_information <- as.tibble(player_information)
player_information
## # A tibble: 1 x 9
## V1 value V3 V4 V5 V6 V7 V8 V9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Mark L… West Coa… "219 " August… " 82" " 18… Round 3, Pic… West Co… Forw…
Selecter gadget is doing a great job, but what we can notice here is that the html_nodes do not change for height, weight, position. So in our earlier case with John Annear what this means is that in his position column it will contain the whole row as we were not able to remove words post ‘position’ like we did for Mark Lecras.
This is where naniar will come in handy, for all those rows that do not have a valid position (duplicated text values we have already in other columns) in this example it would be John Annears Height and Weight appearing in his position column. We can just use naniar and specifically its replace_with_na
function to make those specific data entries na
Lets see how this would work below.
Step 4 - Scrape a whole retired team list.
url<-"https://www.footywire.com/afl/footy/ti-west-coast-eagles"
link<-read_html(url)%>%
html_nodes(".lnormtop a")%>%
html_attr("href")
url_players<-str_c("https://www.footywire.com/afl/footy/",link)
player_info <- function(x){
page<-read_html(x)
player<- page%>%
html_nodes("#playerProfileName")%>%
html_text()
player
playing.for<- page%>%
html_nodes("#playerProfileTeamDiv a b")%>%
html_text() %>% as.tibble()
playing.for
games<-page%>%
html_nodes("#playerProfileData1")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_extract(pattern =("(?<=Games: ).*(?=Born:)"))%>%as.tibble()
games
born<-page%>%
html_nodes("#playerProfileData1")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_remove(".*Born: ")%>%
str_squish() %>% as.tibble()
born
weight<-page%>%
html_nodes("#playerProfileData2")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_extract(pattern =("(?<=Weight:).*(?=kg)"))%>%as.tibble()
weight
height<-page%>%
html_nodes("#playerProfileData2")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_extract(pattern =("(?<=Height:).*(?=cm)"))%>%as.tibble()
height
draft_position <- page%>%
html_nodes("#playerProfileDraftInfo")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_extract(pattern =("(?<=Drafted: ).*(?=by)"))%>%as.tibble()
draft_position
club_drafted <- page%>%
html_nodes("#playerProfileDraftInfo a+ a")%>%
html_text()%>%str_replace_all("[\r\n]" , "")%>%
str_squish()%>%
str_remove(".*by") %>% as.tibble()
club_drafted
position <- page%>%
html_nodes("#playerProfileData2")%>%
html_text()%>%
str_replace_all("[\r\n]" , "")%>%
str_remove(".*Position: ")%>%
str_squish() %>% as.tibble()
position
#combine, name, and make it a tibble
player_information <- cbind.fill(player, playing.for, games,born, weight, height,draft_position, club_drafted, position)
player_information <- as.tibble(player_information)
return(player_information)
}
footywire <- purrr::map_df(url_players, player_info)
footywire
## # A tibble: 238 x 9
## V1 value V3 V4 V5 V6 V7 V8 V9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Corey … West C… <NA> Februa… " 86" " 18… Round 2, … West C… Defender,…
## 2 Damien… West C… "54 " March … " 75" " 18… Round 4, … Collin… Midfield
## 3 John A… West C… <NA> June 1… " 80" " 17… <NA> <NA> Playing H…
## 4 David … West C… "3 " June 1… " 93" " 19… Round 3, … West C… Playing H…
## 5 Steven… West C… "79 " Januar… " 83" " 18… Round 1, … West C… Forward
## 6 Ashley… West C… <NA> April … " 86" " 18… Round 2, … West C… Midfield
## 7 Jason … Sydney… "193 " Novemb… " 10… " 20… Pick #1 1… West C… Ruck
## 8 Drew B… West C… "265 " Februa… " 89" " 18… Pick #1 1… West C… Defender,…
## 9 Adrian… West C… <NA> Decemb… " 85" " 17… <NA> <NA> Playing H…
## 10 Glen B… West C… <NA> June 1… " 95" " 19… <NA> <NA> Playing H…
## # … with 228 more rows
names(footywire) <- c("player", "club", "games","born","weight","height", "draft_position", "club_drafted", "position")
df_replace<-footywire%>%filter(!position %in% c("Midfield", "Defender", "Defender, Forward", "Defender, Midfield",
"Forward", "Forward, Ruck", "Midfield, Forward", "Ruck"))%>%
select(position)
footywire_eagles<-footywire%>%naniar::replace_with_na(replace=list(position=df_replace))
footywire_eagles
## # A tibble: 238 x 9
## player club games born weight height draft_position club_drafted
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Corey… West… <NA> Febr… " 86" " 185" Round 2, Pick… West Coast …
## 2 Damie… West… "54 " Marc… " 75" " 181" Round 4, Pick… Collingwood…
## 3 John … West… <NA> June… " 80" " 178" <NA> <NA>
## 4 David… West… "3 " June… " 93" " 194" Round 3, Pick… West Coast …
## 5 Steve… West… "79 " Janu… " 83" " 180" Round 1, Pick… West Coast …
## 6 Ashle… West… <NA> Apri… " 86" " 188" Round 2, Pick… West Coast …
## 7 Jason… Sydn… "193… Nove… " 104" " 201" Pick #1 1991 … West Coast …
## 8 Drew … West… "265… Febr… " 89" " 184" Pick #1 1992 … West Coast …
## 9 Adria… West… <NA> Dece… " 85" " 177" <NA> <NA>
## 10 Glen … West… <NA> June… " 95" " 195" <NA> <NA>
## # … with 228 more rows, and 1 more variable: position <chr>
Bingo there you have it, how to scrape a whole teams worth of data.