Out of every hundred people,
those who always know better:
fifty-two.
Unsure of every step:
almost all the rest.
- Wislawa Szymborska, A Word on Statistics (translated from the Polish by Joanna Trzeciak)
In this section I’m going to discuss a few methods for computing different kinds of descriptive statistics in R. I won’t talk about data visualisation at all until the next section. In the prelude to data section I referred to the AFL data set that contains some information about every game played in the Australian Rules Football (AFL) league between 1987 and 2010, available in the afl.csv file. As usual, I have tidyverse loaded, so I’ll use readr::read_csv
1 to import the data:
afl <- read_csv("./data/afl.csv")
afl
## # A tibble: 4,296 x 12
## home_team away_team home_score away_score year round weekday
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 North Me… Brisbane 104 137 1987 1 Fri
## 2 Western … Essendon 62 121 1987 1 Sat
## 3 Carlton Hawthorn 104 149 1987 1 Sat
## 4 Collingw… Sydney 74 165 1987 1 Sat
## 5 Melbourne Fitzroy 128 89 1987 1 Sat
## 6 St Kilda Geelong 101 102 1987 1 Sat
## 7 West Coa… Richmond 133 119 1987 1 Sun
## 8 North Me… Melbourne 92 59 1987 2 Fri
## 9 Geelong Brisbane 131 150 1987 2 Sat
## 10 Collingw… Carlton 107 114 1987 2 Sat
## # … with 4,286 more rows, and 5 more variables: day <dbl>,
## # month <dbl>, game_type <chr>, venue <chr>, attendance <dbl>
This data frame has 4296 cases and 12 variables, so I definitely don’t want to be summarising anything manually!
When summarising a variable, there are a variety of different measures that one might want to calculate. Means, medians, standard deviations, skewness, etc. There are so many of these functions out there that I’d never be able to capture all of them briefly, even restricting ourselves to numeric variables. Here is a quick overview of the most common statistics:
mean()
calculates the arithmetic mean. If you want a trimmed mean it has an optional trim
argumentmedian()
calculates the median; more generally, the quantile()
function computes arbitrary quantiles, specified with the probs
argumentmin()
and max()
are the minimum and maximum; IQR()
returns the interquartile range and range()
returns the rangesd()
calculates the standard deviation (and var()
computes the variance)psych::skew()
and psych::kurtosi()
compute skewness and kurtosis respectivelyFor example, to return the mean and standard deviations for the home_score
variable I would use these commands:
mean(afl$home_score)
## [1] 101.5084
sd(afl$home_score)
## [1] 29.65969
It’s worth noting that many of these functions have an na.rm
argument (or similar) that specifies how R will handle missing data (i.e., NA
values). Quite often, the default is to set na.rm = FALSE
meaning that R will not remove missing cases. That can be useful as a way of discovering that your data has missing values: if one of the values is missing then by definition the mean of all your values must also be missing!
age <- c(21, 19, 39, NA, 56)
mean(age)
## [1] NA
In practice though it’s often more an annoyance than an aid, so you’ll often want to calculate the mean like this:
mean(age, na.rm = TRUE)
## [1] 33.75
For a pair of numeric variables, we often want to report the strength and direction of correlations. The cor
function does this, and by default it reports the Pearson correlation:
cor(afl$home_score, afl$away_score)
## [1] -0.1413139
As an aside, I don’t like writing the name of the data frame twice within the same command. It’s a visual distraction and (in my opinion!) it makes the code harder to read. The magrittr package solves this for us. In addition to being the origin of the standard R pipe %>%
, it contains the %$%
operator that “exposes” the names of the variables inside a data frame to another function. So lets load the package:
library(magrittr)
Now here’s how I would compute the same correlation using %$%
:
afl %$% cor(home_score, away_score)
## [1] -0.1413139
In one sense the difference is purely aesthetic, but aesthetics really do matter when analysing data. In the piped version, I have the name of the data set on the left, and the thing that I’m calculating from it on the right. That feels intuitive and “normal” to me. The original command mixes everything together and I find myself struggling to read it properly.
Regardless of which version you prefer, you’ll be pleased to know that the cor
function has a method
argument that lets you specify whether to calculate the pearson
correlation (the default), the spearman
rank-sum correlation, or the kendall
tau measure. For instance:
afl %$% cor(home_score, away_score, method = "spearman")
## [1] -0.1406248
A couple of other things to mention about correlations.
cor.test
function allows you to conduct a hypothesis test for whether a correlation is statistically significant.cor
function can take a matrix or data frame as input, but it will produce an error if any of the variables are non-numeric. I’ve used the lsr::correlate
function to make my life a little easier in this respect, so if I want a correlation matrix that only considers the numeric variables lsr::correlate(afl)
workscor
function doesn’t handle situations where one or more variables are categorical. The psych package has a variety of tools that will solve this for you. It includes functions tetrachoric
, polychoric
, biserial
and polyserial
to handle those situations. I won’t describe those here - I just want to mention them in case you need them later!For categorical variables, where you usually want a table of frequencies that tallies the number of times each value appears, the table
function is your friend:2
afl %$% table(weekday)
## weekday
## Fri Mon Sat Sun Thu Tue Wed
## 514 98 2226 1421 25 6 6
Not surprisingly, most football games are played on the weekend. The table
function is quite flexible, and allows you to compute cross-tabulations between two, three, or more variables simply by adding more variable names. As an example, here is the cross tabulation of year
by weekday
:
afl %$% table(year, weekday)
## weekday
## year Fri Mon Sat Sun Thu Tue Wed
## 1987 18 7 120 34 0 0 0
## 1988 18 8 99 35 0 0 0
## 1989 15 2 103 38 0 2 0
## 1990 13 6 103 37 0 0 2
## 1991 14 6 101 49 2 0 0
## 1992 15 4 94 58 1 0 0
## 1993 19 5 83 50 0 0 0
## 1994 15 5 90 64 0 0 0
BLAH BLAH BLAH
Back in the dark pre-tidyverse era it was surprisingly difficult to compute descriptive statistics separately for each group. There is a function in base R called aggregate()
that lets you do this, but it’s a little cumbersome. The way I usually do this now is with the dplyr functions group_by()
and summarise()
. Let’s suppose I want to compute the mean attendance at AFL games broken down by who was playing. Conceptually, what I want to do is “group by” two variables namely home_team
and away_team
, and “then” for all unique combinations of those variables “summarise” the data by computing the mean attendance
. That sounds like a job for %>%
…
match_popularity <- afl %>%
group_by(home_team, away_team) %>%
summarise(attend = mean(attendance))
## `summarise()` has grouped output by 'home_team'. You can override using the `.groups` argument.
The call to group_by()
creates a “grouped data frame” which doesn’t look any different to the original other than that R now knows what grouping you want! When this grouped data is piped to summarise()
, R knows that it is supposed to provide results broken down by the relevant grouping variables. So this is what the command above returns:
match_popularity
## # A tibble: 270 x 3
## # Groups: home_team [17]
## home_team away_team attend
## <chr> <chr> <dbl>
## 1 Adelaide Brisbane 38638.
## 2 Adelaide Carlton 42256.
## 3 Adelaide Collingwood 42205.
## 4 Adelaide Essendon 42570.
## 5 Adelaide Fitzroy 36204
## 6 Adelaide Fremantle 40030.
## 7 Adelaide Geelong 42588.
## 8 Adelaide Hawthorn 41932.
## 9 Adelaide Melbourne 40050.
## 10 Adelaide North Melbourne 42553.
## # … with 260 more rows
In the first row, we see that the average attendance at a match between Adelaide and Brisbane over this period was about 39,000 people.
The output from our previous command organised the output alphabetically, which is less than ideal. It is usually more helpful to sort the data set in a more meaningful way using dplyr::arrange
. To sort the match_popularity
table in order of increasing attendance, we do this:
match_popularity %>%
arrange(attend)
## # A tibble: 270 x 3
## # Groups: home_team [17]
## home_team away_team attend
## <chr> <chr> <dbl>
## 1 Fitzroy Fremantle 6322
## 2 Fitzroy Brisbane 6787.
## 3 Fitzroy West Coast 7139
## 4 Fitzroy Adelaide 8491
## 5 Brisbane Fitzroy 9426.
## 6 Fitzroy Sydney 9681.
## 7 Fitzroy North Melbourne 10352
## 8 Sydney Fitzroy 10710.
## 9 Fitzroy Melbourne 10890
## 10 Fitzroy Geelong 10951.
## # … with 260 more rows
Clearly, Fitzroy was not a popular team, which would be the reason they were merged with Brisbane in 1996. To sort in descending order we can do this:
match_popularity %>%
arrange(-attend)
## # A tibble: 270 x 3
## # Groups: home_team [17]
## home_team away_team attend
## <chr> <chr> <dbl>
## 1 Collingwood Essendon 73243.
## 2 Collingwood Carlton 68955.
## 3 Essendon Collingwood 66756
## 4 Carlton Collingwood 64162.
## 5 Melbourne Collingwood 60705.
## 6 Essendon Carlton 58752.
## 7 Carlton Essendon 55901.
## 8 Collingwood Geelong 55818.
## 9 Richmond Collingwood 55218.
## 10 Richmond Essendon 54450.
## # … with 260 more rows
Sigh. Collingwood. It’s always Collingwood. Moving on.
How bad was the situation for Fitzroy? Suppose I just want to look at match_popularity
for Fitzroy home games. The dplyr::filter
command lets you do this:
match_popularity %>%
filter(home_team == "Fitzroy")
## # A tibble: 15 x 3
## # Groups: home_team [1]
## home_team away_team attend
## <chr> <chr> <dbl>
## 1 Fitzroy Adelaide 8491
## 2 Fitzroy Brisbane 6787.
## 3 Fitzroy Carlton 17811
## 4 Fitzroy Collingwood 21677.
## 5 Fitzroy Essendon 16550.
## 6 Fitzroy Fremantle 6322
## 7 Fitzroy Geelong 10951.
## 8 Fitzroy Hawthorn 13399.
## 9 Fitzroy Melbourne 10890
## 10 Fitzroy North Melbourne 10352
## 11 Fitzroy Richmond 11187.
## 12 Fitzroy St Kilda 13565
## 13 Fitzroy Sydney 9681.
## 14 Fitzroy West Coast 7139
## 15 Fitzroy Western Bulldogs 13287.
The filter
function can handle more complex logical expressions too. To capture all matches where Fitzroy were playing, regardless of whether they were home or away, a command like this will work:
match_popularity %>%
filter(home_team=="Fitzroy" | away_team =="Fitzroy")
However, because this will return a table with 30 rows, R will truncate the output by default. That’s mildly annoying in this situation. I’d like to see all the results, preferably sorted in a sensible fashion. To sort the outout we pipe the results to arrange
, exactly as we did previously, and to ensure that all the rows are printed we can pipe that to an explicit call to print
that specifies the number of rows to show. Here’s the output:
match_popularity %>%
filter(home_team=="Fitzroy" | away_team =="Fitzroy") %>%
arrange(attend) %>%
print(n = 30)
## # A tibble: 30 x 3
## # Groups: home_team [16]
## home_team away_team attend
## <chr> <chr> <dbl>
## 1 Fitzroy Fremantle 6322
## 2 Fitzroy Brisbane 6787.
## 3 Fitzroy West Coast 7139
## 4 Fitzroy Adelaide 8491
## 5 Brisbane Fitzroy 9426.
## 6 Fitzroy Sydney 9681.
## 7 Fitzroy North Melbourne 10352
## 8 Sydney Fitzroy 10710.
## 9 Fitzroy Melbourne 10890
## 10 Fitzroy Geelong 10951.
## 11 Fitzroy Richmond 11187.
## 12 Western Bulldogs Fitzroy 12935.
## 13 Fitzroy Western Bulldogs 13287.
## 14 Fitzroy Hawthorn 13399.
## 15 Fitzroy St Kilda 13565
## 16 North Melbourne Fitzroy 15861.
## 17 Hawthorn Fitzroy 15871.
## 18 Fitzroy Essendon 16550.
## 19 Fitzroy Carlton 17811
## 20 Geelong Fitzroy 18256.
## 21 St Kilda Fitzroy 18263.
## 22 Carlton Fitzroy 19742.
## 23 Melbourne Fitzroy 20073
## 24 West Coast Fitzroy 20738.
## 25 Fremantle Fitzroy 21324.
## 26 Fitzroy Collingwood 21677.
## 27 Essendon Fitzroy 22906.
## 28 Collingwood Fitzroy 22911.
## 29 Richmond Fitzroy 23007.
## 30 Adelaide Fitzroy 36204
Poor Fitzroy 😭
Many data analysis situations present you with a lot of variables to work with, and you might need to select
a few variables to look at. It’s not such a big deal for the AFL data as there are only 12 variables in the raw data, but even there when we print out the tibble it doesn’t show us all the cases. Let’s suppose we have simple problem where what we want to do is extract the fixture for round 1 of the 1994 season. That is, the only thing we want is the name of the teams playing, the venue and the weekday:
afl %>%
filter(year == 1994 & round == 1) %>%
select(home_team, away_team, venue, weekday)
## # A tibble: 7 x 4
## home_team away_team venue weekday
## <chr> <chr> <chr> <chr>
## 1 Western Bulldogs Richmond Western Oval Sat
## 2 Essendon West Coast MCG Sat
## 3 Collingwood Fitzroy Victoria Park Sat
## 4 St Kilda Hawthorn Waverley Park Sat
## 5 Brisbane Sydney Gabba Sun
## 6 Adelaide Carlton AAMI Stadium Sun
## 7 Melbourne Geelong MCG Sun
Well that was easy!
Very often the raw data you are given doesn’t contain the actual quantity of iteres, and you might need to compute it from the other variables. The mutate
function allows you to create new variables within a data frame:
afl %>%
mutate(margin = abs(home_score - away_score)) %>%
select(home_score, away_score, margin)
## # A tibble: 4,296 x 3
## home_score away_score margin
## <dbl> <dbl> <dbl>
## 1 104 137 33
## 2 62 121 59
## 3 104 149 45
## 4 74 165 91
## 5 128 89 39
## 6 101 102 1
## 7 133 119 14
## 8 92 59 33
## 9 131 150 19
## 10 107 114 7
## # … with 4,286 more rows
where the select
command is just for visual clarity in the output! As always, remember that if you want to store the new variable you need to assign the result to a variable.
A quick explanation if you’re unfamiliar with the ::
operator. It’s a tool you can use to referring to a function inside a package regardless of whether that package is loaded. So readr::read_csv
means “use the read_csv
function in the readr package”. In this instance readr is loaded via tidyverse so my actual command doesn’t use ::
. However, what you see a lot of people doing in textbooks or online is use this notation as a shorthand way to tell people where the function is located. The further we go into these notes the more likely I am to start doing that. It’s a pretty convenient way of talking about these things!↩︎
Note that in this output the weekdays are printed alphabetically rather than in a sensible order. That’s because I imported the data straight from a CSV file and I haven’t taken the time to convery weekday
to a factor (see the data types section). You can reorder the levels of a factor however you like. I wrote my own function lsr::permuteLevels
that does this; there’s also the forcats::fct_relevel
function and the simpler relevel
function in base R that helps with this. I’ll talk more about this later.↩︎