describing-data.utf8

Out of every hundred people,
those who always know better:
fifty-two.

Unsure of every step:
almost all the rest.

- Wislawa Szymborska, A Word on Statistics (translated from the Polish by Joanna Trzeciak)

In this section I’m going to discuss a few methods for computing different kinds of descriptive statistics in R. I won’t talk about data visualisation at all until the next section. In the prelude to data section I referred to the AFL data set that contains some information about every game played in the Australian Rules Football (AFL) league between 1987 and 2010, available in the afl.csv file. As usual, I have tidyverse loaded, so I’ll use readr::read_csv¹ to import the data:

afl <- read_csv("./data/afl.csv")
afl

## # A tibble: 4,296 x 12
##    home_team away_team home_score away_score  year round weekday
##    <chr>     <chr>          <dbl>      <dbl> <dbl> <dbl> <chr>  
##  1 North Me… Brisbane         104        137  1987     1 Fri    
##  2 Western … Essendon          62        121  1987     1 Sat    
##  3 Carlton   Hawthorn         104        149  1987     1 Sat    
##  4 Collingw… Sydney            74        165  1987     1 Sat    
##  5 Melbourne Fitzroy          128         89  1987     1 Sat    
##  6 St Kilda  Geelong          101        102  1987     1 Sat    
##  7 West Coa… Richmond         133        119  1987     1 Sun    
##  8 North Me… Melbourne         92         59  1987     2 Fri    
##  9 Geelong   Brisbane         131        150  1987     2 Sat    
## 10 Collingw… Carlton          107        114  1987     2 Sat    
## # … with 4,286 more rows, and 5 more variables: day <dbl>,
## #   month <dbl>, game_type <chr>, venue <chr>, attendance <dbl>

This data frame has 4296 cases and 12 variables, so I definitely don’t want to be summarising anything manually!

14.1 Simple summaries

When summarising a variable, there are a variety of different measures that one might want to calculate. Means, medians, standard deviations, skewness, etc. There are so many of these functions out there that I’d never be able to capture all of them briefly, even restricting ourselves to numeric variables. Here is a quick overview of the most common statistics:

mean() calculates the arithmetic mean. If you want a trimmed mean it has an optional trim argument
median() calculates the median; more generally, the quantile() function computes arbitrary quantiles, specified with the probs argument
min() and max() are the minimum and maximum; IQR() returns the interquartile range and range() returns the range
sd() calculates the standard deviation (and var() computes the variance)
psych::skew() and psych::kurtosi() compute skewness and kurtosis respectively

For example, to return the mean and standard deviations for the home_score variable I would use these commands:

mean(afl$home_score)

## [1] 101.5084

sd(afl$home_score)

## [1] 29.65969

It’s worth noting that many of these functions have an na.rm argument (or similar) that specifies how R will handle missing data (i.e., NA values). Quite often, the default is to set na.rm = FALSE meaning that R will not remove missing cases. That can be useful as a way of discovering that your data has missing values: if one of the values is missing then by definition the mean of all your values must also be missing!

age <- c(21, 19, 39, NA, 56)
mean(age)

## [1] NA

In practice though it’s often more an annoyance than an aid, so you’ll often want to calculate the mean like this:

mean(age, na.rm = TRUE)

## [1] 33.75

14.2 Correlations

For a pair of numeric variables, we often want to report the strength and direction of correlations. The cor function does this, and by default it reports the Pearson correlation:

cor(afl$home_score, afl$away_score)

## [1] -0.1413139

As an aside, I don’t like writing the name of the data frame twice within the same command. It’s a visual distraction and (in my opinion!) it makes the code harder to read. The magrittr package solves this for us. In addition to being the origin of the standard R pipe %>%, it contains the %$% operator that “exposes” the names of the variables inside a data frame to another function. So lets load the package:

library(magrittr)

Now here’s how I would compute the same correlation using %$%:

afl %$% cor(home_score, away_score)

## [1] -0.1413139

In one sense the difference is purely aesthetic, but aesthetics really do matter when analysing data. In the piped version, I have the name of the data set on the left, and the thing that I’m calculating from it on the right. That feels intuitive and “normal” to me. The original command mixes everything together and I find myself struggling to read it properly.

Regardless of which version you prefer, you’ll be pleased to know that the cor function has a method argument that lets you specify whether to calculate the pearson correlation (the default), the spearman rank-sum correlation, or the kendall tau measure. For instance:

afl %$% cor(home_score, away_score, method = "spearman")

## [1] -0.1406248

A couple of other things to mention about correlations.

The cor.test function allows you to conduct a hypothesis test for whether a correlation is statistically significant.
The cor function can take a matrix or data frame as input, but it will produce an error if any of the variables are non-numeric. I’ve used the lsr::correlate function to make my life a little easier in this respect, so if I want a correlation matrix that only considers the numeric variables lsr::correlate(afl) works
The cor function doesn’t handle situations where one or more variables are categorical. The psych package has a variety of tools that will solve this for you. It includes functions tetrachoric, polychoric, biserial and polyserial to handle those situations. I won’t describe those here - I just want to mention them in case you need them later!

14.3 Frequency tables

For categorical variables, where you usually want a table of frequencies that tallies the number of times each value appears, the table function is your friend:²

afl %$% table(weekday)

## weekday
##  Fri  Mon  Sat  Sun  Thu  Tue  Wed 
##  514   98 2226 1421   25    6    6

Not surprisingly, most football games are played on the weekend. The table function is quite flexible, and allows you to compute cross-tabulations between two, three, or more variables simply by adding more variable names. As an example, here is the cross tabulation of year by weekday:

afl %$% table(year, weekday)

##       weekday
## year   Fri Mon Sat Sun Thu Tue Wed
##   1987  18   7 120  34   0   0   0
##   1988  18   8  99  35   0   0   0
##   1989  15   2 103  38   0   2   0
##   1990  13   6 103  37   0   0   2
##   1991  14   6 101  49   2   0   0
##   1992  15   4  94  58   1   0   0
##   1993  19   5  83  50   0   0   0
##   1994  15   5  90  64   0   0   0
BLAH BLAH BLAH

14.4 Summarising by group

Back in the dark pre-tidyverse era it was surprisingly difficult to compute descriptive statistics separately for each group. There is a function in base R called aggregate() that lets you do this, but it’s a little cumbersome. The way I usually do this now is with the dplyr functions group_by() and summarise(). Let’s suppose I want to compute the mean attendance at AFL games broken down by who was playing. Conceptually, what I want to do is “group by” two variables namely home_team and away_team, and “then” for all unique combinations of those variables “summarise” the data by computing the mean attendance. That sounds like a job for %>%…

match_popularity <- afl %>%
  group_by(home_team, away_team) %>%
  summarise(attend = mean(attendance))

## `summarise()` has grouped output by 'home_team'. You can override using the `.groups` argument.

The call to group_by() creates a “grouped data frame” which doesn’t look any different to the original other than that R now knows what grouping you want! When this grouped data is piped to summarise(), R knows that it is supposed to provide results broken down by the relevant grouping variables. So this is what the command above returns:

match_popularity

## # A tibble: 270 x 3
## # Groups:   home_team [17]
##    home_team away_team       attend
##    <chr>     <chr>            <dbl>
##  1 Adelaide  Brisbane        38638.
##  2 Adelaide  Carlton         42256.
##  3 Adelaide  Collingwood     42205.
##  4 Adelaide  Essendon        42570.
##  5 Adelaide  Fitzroy         36204 
##  6 Adelaide  Fremantle       40030.
##  7 Adelaide  Geelong         42588.
##  8 Adelaide  Hawthorn        41932.
##  9 Adelaide  Melbourne       40050.
## 10 Adelaide  North Melbourne 42553.
## # … with 260 more rows

In the first row, we see that the average attendance at a match between Adelaide and Brisbane over this period was about 39,000 people.

14.5 Arranging data

The output from our previous command organised the output alphabetically, which is less than ideal. It is usually more helpful to sort the data set in a more meaningful way using dplyr::arrange. To sort the match_popularity table in order of increasing attendance, we do this:

match_popularity %>%
  arrange(attend)

## # A tibble: 270 x 3
## # Groups:   home_team [17]
##    home_team away_team       attend
##    <chr>     <chr>            <dbl>
##  1 Fitzroy   Fremantle        6322 
##  2 Fitzroy   Brisbane         6787.
##  3 Fitzroy   West Coast       7139 
##  4 Fitzroy   Adelaide         8491 
##  5 Brisbane  Fitzroy          9426.
##  6 Fitzroy   Sydney           9681.
##  7 Fitzroy   North Melbourne 10352 
##  8 Sydney    Fitzroy         10710.
##  9 Fitzroy   Melbourne       10890 
## 10 Fitzroy   Geelong         10951.
## # … with 260 more rows

Clearly, Fitzroy was not a popular team, which would be the reason they were merged with Brisbane in 1996. To sort in descending order we can do this:

match_popularity %>%
  arrange(-attend)

## # A tibble: 270 x 3
## # Groups:   home_team [17]
##    home_team   away_team   attend
##    <chr>       <chr>        <dbl>
##  1 Collingwood Essendon    73243.
##  2 Collingwood Carlton     68955.
##  3 Essendon    Collingwood 66756 
##  4 Carlton     Collingwood 64162.
##  5 Melbourne   Collingwood 60705.
##  6 Essendon    Carlton     58752.
##  7 Carlton     Essendon    55901.
##  8 Collingwood Geelong     55818.
##  9 Richmond    Collingwood 55218.
## 10 Richmond    Essendon    54450.
## # … with 260 more rows

Sigh. Collingwood. It’s always Collingwood. Moving on.

14.6 Filtering cases

How bad was the situation for Fitzroy? Suppose I just want to look at match_popularity for Fitzroy home games. The dplyr::filter command lets you do this:

match_popularity %>%
  filter(home_team == "Fitzroy")

## # A tibble: 15 x 3
## # Groups:   home_team [1]
##    home_team away_team        attend
##    <chr>     <chr>             <dbl>
##  1 Fitzroy   Adelaide          8491 
##  2 Fitzroy   Brisbane          6787.
##  3 Fitzroy   Carlton          17811 
##  4 Fitzroy   Collingwood      21677.
##  5 Fitzroy   Essendon         16550.
##  6 Fitzroy   Fremantle         6322 
##  7 Fitzroy   Geelong          10951.
##  8 Fitzroy   Hawthorn         13399.
##  9 Fitzroy   Melbourne        10890 
## 10 Fitzroy   North Melbourne  10352 
## 11 Fitzroy   Richmond         11187.
## 12 Fitzroy   St Kilda         13565 
## 13 Fitzroy   Sydney            9681.
## 14 Fitzroy   West Coast        7139 
## 15 Fitzroy   Western Bulldogs 13287.

The filter function can handle more complex logical expressions too. To capture all matches where Fitzroy were playing, regardless of whether they were home or away, a command like this will work:

match_popularity %>%
  filter(home_team=="Fitzroy" | away_team =="Fitzroy")

However, because this will return a table with 30 rows, R will truncate the output by default. That’s mildly annoying in this situation. I’d like to see all the results, preferably sorted in a sensible fashion. To sort the outout we pipe the results to arrange, exactly as we did previously, and to ensure that all the rows are printed we can pipe that to an explicit call to print that specifies the number of rows to show. Here’s the output:

match_popularity %>%
  filter(home_team=="Fitzroy" | away_team =="Fitzroy") %>%
  arrange(attend) %>%
  print(n = 30)

## # A tibble: 30 x 3
## # Groups:   home_team [16]
##    home_team        away_team        attend
##    <chr>            <chr>             <dbl>
##  1 Fitzroy          Fremantle         6322 
##  2 Fitzroy          Brisbane          6787.
##  3 Fitzroy          West Coast        7139 
##  4 Fitzroy          Adelaide          8491 
##  5 Brisbane         Fitzroy           9426.
##  6 Fitzroy          Sydney            9681.
##  7 Fitzroy          North Melbourne  10352 
##  8 Sydney           Fitzroy          10710.
##  9 Fitzroy          Melbourne        10890 
## 10 Fitzroy          Geelong          10951.
## 11 Fitzroy          Richmond         11187.
## 12 Western Bulldogs Fitzroy          12935.
## 13 Fitzroy          Western Bulldogs 13287.
## 14 Fitzroy          Hawthorn         13399.
## 15 Fitzroy          St Kilda         13565 
## 16 North Melbourne  Fitzroy          15861.
## 17 Hawthorn         Fitzroy          15871.
## 18 Fitzroy          Essendon         16550.
## 19 Fitzroy          Carlton          17811 
## 20 Geelong          Fitzroy          18256.
## 21 St Kilda         Fitzroy          18263.
## 22 Carlton          Fitzroy          19742.
## 23 Melbourne        Fitzroy          20073 
## 24 West Coast       Fitzroy          20738.
## 25 Fremantle        Fitzroy          21324.
## 26 Fitzroy          Collingwood      21677.
## 27 Essendon         Fitzroy          22906.
## 28 Collingwood      Fitzroy          22911.
## 29 Richmond         Fitzroy          23007.
## 30 Adelaide         Fitzroy          36204

Poor Fitzroy 😭

14.7 Selecting variables

Many data analysis situations present you with a lot of variables to work with, and you might need to select a few variables to look at. It’s not such a big deal for the AFL data as there are only 12 variables in the raw data, but even there when we print out the tibble it doesn’t show us all the cases. Let’s suppose we have simple problem where what we want to do is extract the fixture for round 1 of the 1994 season. That is, the only thing we want is the name of the teams playing, the venue and the weekday:

afl %>%
  filter(year == 1994 & round == 1) %>%
  select(home_team, away_team, venue, weekday)

## # A tibble: 7 x 4
##   home_team        away_team  venue         weekday
##   <chr>            <chr>      <chr>         <chr>  
## 1 Western Bulldogs Richmond   Western Oval  Sat    
## 2 Essendon         West Coast MCG           Sat    
## 3 Collingwood      Fitzroy    Victoria Park Sat    
## 4 St Kilda         Hawthorn   Waverley Park Sat    
## 5 Brisbane         Sydney     Gabba         Sun    
## 6 Adelaide         Carlton    AAMI Stadium  Sun    
## 7 Melbourne        Geelong    MCG           Sun

Well that was easy!

14.8 New variables with mutate

Very often the raw data you are given doesn’t contain the actual quantity of iteres, and you might need to compute it from the other variables. The mutate function allows you to create new variables within a data frame:

afl %>%
  mutate(margin = abs(home_score - away_score)) %>%
  select(home_score, away_score, margin)

## # A tibble: 4,296 x 3
##    home_score away_score margin
##         <dbl>      <dbl>  <dbl>
##  1        104        137     33
##  2         62        121     59
##  3        104        149     45
##  4         74        165     91
##  5        128         89     39
##  6        101        102      1
##  7        133        119     14
##  8         92         59     33
##  9        131        150     19
## 10        107        114      7
## # … with 4,286 more rows

where the select command is just for visual clarity in the output! As always, remember that if you want to store the new variable you need to assign the result to a variable.

A quick explanation if you’re unfamiliar with the :: operator. It’s a tool you can use to referring to a function inside a package regardless of whether that package is loaded. So readr::read_csv means “use the read_csv function in the readr package”. In this instance readr is loaded via tidyverse so my actual command doesn’t use ::. However, what you see a lot of people doing in textbooks or online is use this notation as a shorthand way to tell people where the function is located. The further we go into these notes the more likely I am to start doing that. It’s a pretty convenient way of talking about these things!↩︎
Note that in this output the weekdays are printed alphabetically rather than in a sensible order. That’s because I imported the data straight from a CSV file and I haven’t taken the time to convery weekday to a factor (see the data types section). You can reorder the levels of a factor however you like. I wrote my own function lsr::permuteLevels that does this; there’s also the forcats::fct_relevel function and the simpler relevel function in base R that helps with this. I’ll talk more about this later.↩︎