Data manipulation isn’t a particularly well defined topic. I take it to cover a few topics:

  • Importing and exporting data
  • Cleaning up errors in data
  • Reorganising data

Earlier sections of the notes covered variations of this:

  • Reading and writing from CSV files and .RData files
  • Creating data frames, tibbles and factors from other input
  • Using filter and select to extract subsets of a data frame
  • Using mutate to create or alter columns in a data frame
  • Using arrange to sort a data frame

16.1 An illustrative problem

In the data visualisation we used a data set looking at inductive reasoning stored in the frames_ex2.csv file. Here it is again:

frames <- read_csv("./data/frames_ex2.csv")
frames
## # A tibble: 4,725 x 8
##       id gender   age condition sample_size n_obs test_item
##    <dbl> <chr>  <dbl> <chr>     <chr>       <dbl>     <dbl>
##  1     1 male      36 category  small           2         1
##  2     1 male      36 category  small           2         2
##  3     1 male      36 category  small           2         3
##  4     1 male      36 category  small           2         4
##  5     1 male      36 category  small           2         5
##  6     1 male      36 category  small           2         6
##  7     1 male      36 category  small           2         7
##  8     1 male      36 category  medium          6         1
##  9     1 male      36 category  medium          6         2
## 10     1 male      36 category  medium          6         3
## # … with 4,715 more rows, and 1 more variable: response <dbl>

In this file the data are stored in long form, in which there is one row for every trial in the experiment and thus the data from each participant are spread across several rows. I stored the data in that format because it was handy for drawing pictures, but as it happen this isn’t how the data set was stored when I extracted it from the OSF repository in which I’d deposited it only a few months earlier! Here’s what it looked like when I found it:

wide_frames <- read_csv("./data/frames_ex2_wide.csv")
wide_frames
## # A tibble: 225 x 25
##       ID Gender   Age Sampling `SS2-R1` `SS2-R2` `SS2-R3`
##    <dbl> <chr>  <dbl> <chr>       <dbl>    <dbl>    <dbl>
##  1     1 male      36 category        8        7        6
##  2     2 male      46 category        7        5        5
##  3     3 female    33 property        8        7        7
##  4     4 female    71 property        6        7        6
##  5     5 female    23 property        9        9        9
##  6     6 female    31 category        7        7        7
##  7     7 male      23 property        6        7        5
##  8     8 female    31 property        4        4        2
##  9     9 female    37 category        9        9        7
## 10    10 female    46 category        8        6        4
## # … with 215 more rows, and 18 more variables: SS2-R4 <dbl>,
## #   SS2-R5 <dbl>, SS2-R6 <dbl>, SS2-R7 <dbl>, SS6-R1 <dbl>,
## #   SS6-R2 <dbl>, SS6-R3 <dbl>, SS6-R4 <dbl>, SS6-R5 <dbl>,
## #   SS6-R6 <dbl>, SS6-R7 <dbl>, SS12-R1 <dbl>, SS12-R2 <dbl>,
## #   SS12-R3 <dbl>, SS12-R4 <dbl>, SS12-R5 <dbl>, SS12-R6 <dbl>,
## #   SS12-R7 <dbl>

In this file, the data are stored in wide form. The data frame contains only one row per person, and each judgment they make is stored as a separate variable. Wide and long form data are useful for a number of purposes. How do we switch between them, “reshaping” the data to the form we need?

Once upon a time this used to be a difficult task, so much so that I wrote my own (not very good) functions that would solve this problem in a way that didn’t expose my students to some of the more frustrating aspects of data manipulation. Thankfully, this has changed. There is a little package distributed with the tidyverse called tidyr and it is remarkably effective at solving a range of reshaping problems in a (fairly) intuitive way.

16.2 Reshaping a data frame

long_frames <- wide_frames %>% 
  gather(key = "query", value="response",
         "SS2-R1", "SS2-R2","SS2-R3","SS2-R4","SS2-R5","SS2-R6","SS2-R7",
         "SS6-R1","SS6-R2","SS6-R3","SS6-R4","SS6-R5","SS6-R6","SS6-R7", 
         "SS12-R1","SS12-R2","SS12-R3","SS12-R4","SS12-R5","SS12-R6","SS12-R7")
long_frames
## # A tibble: 4,725 x 6
##       ID Gender   Age Sampling query  response
##    <dbl> <chr>  <dbl> <chr>    <chr>     <dbl>
##  1     1 male      36 category SS2-R1        8
##  2     2 male      46 category SS2-R1        7
##  3     3 female    33 property SS2-R1        8
##  4     4 female    71 property SS2-R1        6
##  5     5 female    23 property SS2-R1        9
##  6     6 female    31 category SS2-R1        7
##  7     7 male      23 property SS2-R1        6
##  8     8 female    31 property SS2-R1        4
##  9     9 female    37 category SS2-R1        9
## 10    10 female    46 category SS2-R1        8
## # … with 4,715 more rows

To go back the other way, we could do this:

long_frames %>% spread(key = "query", value = "response")
## # A tibble: 225 x 25
##       ID Gender   Age Sampling `SS12-R1` `SS12-R2` `SS12-R3`
##    <dbl> <chr>  <dbl> <chr>        <dbl>     <dbl>     <dbl>
##  1     1 male      36 category         8         7         6
##  2     2 male      46 category         9         9         8
##  3     3 female    33 property         8         9         9
##  4     4 female    71 property         9         8         3
##  5     5 female    23 property         9         9         9
##  6     6 female    31 category         9         9         8
##  7     7 male      23 property         8         7         6
##  8     8 female    31 property         8         8         7
##  9     9 female    37 category         9         8         1
## 10    10 female    46 category         9         9         8
## # … with 215 more rows, and 18 more variables: SS12-R4 <dbl>,
## #   SS12-R5 <dbl>, SS12-R6 <dbl>, SS12-R7 <dbl>, SS2-R1 <dbl>,
## #   SS2-R2 <dbl>, SS2-R3 <dbl>, SS2-R4 <dbl>, SS2-R5 <dbl>,
## #   SS2-R6 <dbl>, SS2-R7 <dbl>, SS6-R1 <dbl>, SS6-R2 <dbl>,
## #   SS6-R3 <dbl>, SS6-R4 <dbl>, SS6-R5 <dbl>, SS6-R6 <dbl>,
## #   SS6-R7 <dbl>

16.3 Splitting a variable

The long_frames data frame we created is close to what we need, but the query variable isn’t quite right. A value of "SS2-R1" corresponds to a trial on which the sample size was 2 and the test item was 1. This should properly be two variables, one specifying the sample_size and the other specifying the value of the test_item.

long_frames %>% 
  separate(
    col = query, 
    into = c("sample_size","test_item"), 
    sep = "-R"
  )
## # A tibble: 4,725 x 7
##       ID Gender   Age Sampling sample_size test_item response
##    <dbl> <chr>  <dbl> <chr>    <chr>       <chr>        <dbl>
##  1     1 male      36 category SS2         1                8
##  2     2 male      46 category SS2         1                7
##  3     3 female    33 property SS2         1                8
##  4     4 female    71 property SS2         1                6
##  5     5 female    23 property SS2         1                9
##  6     6 female    31 category SS2         1                7
##  7     7 male      23 property SS2         1                6
##  8     8 female    31 property SS2         1                4
##  9     9 female    37 category SS2         1                9
## 10    10 female    46 category SS2         1                8
## # … with 4,715 more rows

Getting closer. Note that there is a unite() function that reverses this operation, so if you ever need to collapse multiple columns into a single variable that would be the way to do it. It’s also worth noting that the two new variables aren’t quite in the format we want them to be. For instance, the test_item variable should be numeric rather than character, so lets mutate that and store the results:

long_frames <- long_frames %>% 
  separate(
    col = query, 
    into = c("sample_size","test_item"), 
    sep = "-R"
  ) %>%
  mutate(test_item = as.numeric(test_item))
long_frames
## # A tibble: 4,725 x 7
##       ID Gender   Age Sampling sample_size test_item response
##    <dbl> <chr>  <dbl> <chr>    <chr>           <dbl>    <dbl>
##  1     1 male      36 category SS2                 1        8
##  2     2 male      46 category SS2                 1        7
##  3     3 female    33 property SS2                 1        8
##  4     4 female    71 property SS2                 1        6
##  5     5 female    23 property SS2                 1        9
##  6     6 female    31 category SS2                 1        7
##  7     7 male      23 property SS2                 1        6
##  8     8 female    31 property SS2                 1        4
##  9     9 female    37 category SS2                 1        9
## 10    10 female    46 category SS2                 1        8
## # … with 4,715 more rows

16.4 Recoding variables

16.4.1 Using recode()

What should we do with the sample_size variable?

long_frames <- long_frames %>%
  mutate(sample_size = dplyr::recode(
    sample_size, 
    "SS2" = "small", 
    "SS6" = "medium", 
    "SS12" = "large")
  )
long_frames
## # A tibble: 4,725 x 7
##       ID Gender   Age Sampling sample_size test_item response
##    <dbl> <chr>  <dbl> <chr>    <chr>           <dbl>    <dbl>
##  1     1 male      36 category small               1        8
##  2     2 male      46 category small               1        7
##  3     3 female    33 property small               1        8
##  4     4 female    71 property small               1        6
##  5     5 female    23 property small               1        9
##  6     6 female    31 category small               1        7
##  7     7 male      23 property small               1        6
##  8     8 female    31 property small               1        4
##  9     9 female    37 category small               1        9
## 10    10 female    46 category small               1        8
## # … with 4,715 more rows

16.4.2 Using case_when()

An alternative method is to use case_when().

long_frames <- long_frames %>%
  mutate(n_obs = case_when(
    sample_size == "small" ~ 2,
    sample_size == "medium" ~ 6,
    sample_size == "large" ~ 12
  ))
long_frames
## # A tibble: 4,725 x 8
##       ID Gender   Age Sampling sample_size test_item response
##    <dbl> <chr>  <dbl> <chr>    <chr>           <dbl>    <dbl>
##  1     1 male      36 category small               1        8
##  2     2 male      46 category small               1        7
##  3     3 female    33 property small               1        8
##  4     4 female    71 property small               1        6
##  5     5 female    23 property small               1        9
##  6     6 female    31 category small               1        7
##  7     7 male      23 property small               1        6
##  8     8 female    31 property small               1        4
##  9     9 female    37 category small               1        9
## 10    10 female    46 category small               1        8
## # … with 4,715 more rows, and 1 more variable: n_obs <dbl>

Now we have a data frame that is almost identical to the one we imported in the previous section. Apart from some cosmetic changes to variable names and order, the job is complete!

16.5 Processing dates

Empirical data sets often include date fields. Sometimes these dates are nicely formatted and are easy to work with. Most of the time they are not. Even if they were, dates are inherently painful: a year isn’t a perfect multiple of days, so a solar calendar needs leap years and other corrections. The year doesn’t divide easily into 12 months so months have a variable number of days and doing “addition” with months is really weird. The week days don’t distribute themelves cleanly across months or years, so those are a headache too 🤕. It’s not purely an issue of nomenclature. We organise our lives around seasons, weekdays/weekends, day/night and various other cycles that don’t always play nicely with one another. The weirdness of dates reflects the structure of the world and our lives, and data analysts can’t pretend that this complexity isn’t there. To make things more annoying there are cultural and lingustic variations even within the English speaking world, so that is an additional source of ambiguity. Does 9/11 refer to September 11 or November 9? It depends on where you live. 🤷. Because of all this, date specification is hard. The Date class in R provides a relatively decent way to handle this kind of data, and the lubridate package (another tidyverse one!) provides a very helpful tool to make it easier. It doesn’t load automatically so lets do that now:

library(lubridate)

16.5.1 Getting started

Date information is often specified in terms of “year”, “month” and “day”, though not always in that order. lubridate provides a collection of functions ymd, mdy, etc that are fairly smart and able to parse information in many different formats. As long as you know the order of the information, these functions work really well:

ymd(20101215)
ymd("2010/12/15")
ymd("2010 December 15")
## [1] "2010-12-15"
## [1] "2010-12-15"
## [1] "2010-12-15"

Despite the variations in formatting the function is able to work out that these all correspond to the same date and returns an variable of class Date. This gives you quite a bit of flexibility:

ymd(20101215) == mdy("12/15/10")
## [1] TRUE

Nice! The functions are smart enough to handle strings and numbers as input, and if the year is indeterminate it assumes a 21st century date. That said, they do rely on you knowing the correct order. The same numeric input, formatted differently…

dmy(20101215)
## [1] "1215-10-20"

… might indeed correspond to 20th October 1215. This might be plausible if the data set somehow pertained to the Magna Carta, but probably not for most settings of interest to psychologists!

16.5.2 A real world example

To give you a sense of how annoying working with dates can be, the following is a real example. I asked my partner – who in real life is a statistics consultant among other things – to send me a data set she’d been sent with poorly formatted date information. This is what she sent me:

oddball <- readLines("./data/oddballdate2.csv")
oddball
##  [1] "4/4/16 9:00"   "4/4/16 14:13"  "6/4/16 5:55"  
##  [4] "6/4/16 11:27"  "8/4/16 15:04"  "12/4/16 12:29"
##  [7] "14/4/16 12:14" "14/4/16 19:16" "18/4/16 16:18"
## [10] "20/4/16 11:10" "28/4/16 16:10" "28/4/16"      
## [13] "1/5/16 17:11"  "2/5/16 14:53"  "3/5/16 10:32" 
## [16] "3/5/16 12:46"  "3/5/16 13:08"  "3/5/16 15:51" 
## [19] "4/5/16 11:04"  "4/5/16 11:13"  "4/5/16 11:16" 
## [22] "4/5/16 12:54"  "4/5/16 14:40"  "4/5/16 20:16" 
## [25] "6/5/16 15:13"  "10/5/16 11:17" "11/5/16 12:41"
## [28] "12/5/16"       "12/5/16"       "5/13/16"      
## [31] "16/5/16 10:11" "16/5/16 22:47" "17/5/16 15:20"
## [34] "18/5/16"       "18/5/16 10:32" "19/5/16 2:27" 
## [37] "19/5/16 10:59" "19/5/16"

To keep this example minimal, I’ve used readLines() to import the data as a single character vector. As you can see this pretty inconsistent, and my partner’s face when I turned to her in horror can only be described with emoji: 😈

A brief look at this suggests that most of the data have a time stamp as well as date information, which suggests that the dmy_hm function should be able to parse them. Others are date only, suggesting that dmy should work. If I try these individually this is what happens:

dmy(oddball)
## Warning: 33 failed to parse.
##  [1] NA           NA           NA           NA          
##  [5] NA           NA           NA           NA          
##  [9] NA           NA           NA           "2016-04-28"
## [13] NA           NA           NA           NA          
## [17] NA           NA           NA           NA          
## [21] NA           NA           NA           NA          
## [25] NA           NA           NA           "2016-05-12"
## [29] "2016-05-12" NA           NA           NA          
## [33] NA           "2016-05-18" NA           NA          
## [37] NA           "2016-05-19"
dmy_hm(oddball)
## Warning: 6 failed to parse.
##  [1] "2016-04-04 09:00:00 UTC" "2016-04-04 14:13:00 UTC"
##  [3] "2016-04-06 05:55:00 UTC" "2016-04-06 11:27:00 UTC"
##  [5] "2016-04-08 15:04:00 UTC" "2016-04-12 12:29:00 UTC"
##  [7] "2016-04-14 12:14:00 UTC" "2016-04-14 19:16:00 UTC"
##  [9] "2016-04-18 16:18:00 UTC" "2016-04-20 11:10:00 UTC"
## [11] "2016-04-28 16:10:00 UTC" NA                       
## [13] "2016-05-01 17:11:00 UTC" "2016-05-02 14:53:00 UTC"
## [15] "2016-05-03 10:32:00 UTC" "2016-05-03 12:46:00 UTC"
## [17] "2016-05-03 13:08:00 UTC" "2016-05-03 15:51:00 UTC"
## [19] "2016-05-04 11:04:00 UTC" "2016-05-04 11:13:00 UTC"
## [21] "2016-05-04 11:16:00 UTC" "2016-05-04 12:54:00 UTC"
## [23] "2016-05-04 14:40:00 UTC" "2016-05-04 20:16:00 UTC"
## [25] "2016-05-06 15:13:00 UTC" "2016-05-10 11:17:00 UTC"
## [27] "2016-05-11 12:41:00 UTC" NA                       
## [29] NA                        NA                       
## [31] "2016-05-16 10:11:00 UTC" "2016-05-16 22:47:00 UTC"
## [33] "2016-05-17 15:20:00 UTC" NA                       
## [35] "2016-05-18 10:32:00 UTC" "2016-05-19 02:27:00 UTC"
## [37] "2016-05-19 10:59:00 UTC" NA

Each function converts the dates it can, inserts NA for the dates it can’t, and loudly complain that something is amiss. That’s pretty reasonable.

d1 <- oddball %>% dmy_hm() %>% date()
d2 <- oddball %>% dmy() 
clean_dates <- case_when(
  is.na(d1) ~ d2, 
  TRUE ~ d1
  )
clean_dates
##  [1] "2016-04-04" "2016-04-04" "2016-04-06" "2016-04-06"
##  [5] "2016-04-08" "2016-04-12" "2016-04-14" "2016-04-14"
##  [9] "2016-04-18" "2016-04-20" "2016-04-28" "2016-04-28"
## [13] "2016-05-01" "2016-05-02" "2016-05-03" "2016-05-03"
## [17] "2016-05-03" "2016-05-03" "2016-05-04" "2016-05-04"
## [21] "2016-05-04" "2016-05-04" "2016-05-04" "2016-05-04"
## [25] "2016-05-06" "2016-05-10" "2016-05-11" "2016-05-12"
## [29] "2016-05-12" NA           "2016-05-16" "2016-05-16"
## [33] "2016-05-17" "2016-05-18" "2016-05-18" "2016-05-19"
## [37] "2016-05-19" "2016-05-19"

Well that almost worked. What’s the one NA value?

oddball[is.na(clean_dates)]
## [1] "5/13/16"

Sigh. Looking at all the other dates, every single one is in April or May… but this last one is ostensibly on the 5th day of the 13th month? Obviously we have one entry formatted in a mdy() form. I will leave it as an exercise to the reader to work out how to fix that one 😀

16.6 Combining data frames

Another common data manipulation problem arises when we need to combine multiple data frames. Two versions of this problem are worth talking about, namely binding and joining

16.6.1 Binding

Binding data frames is simple. Suppose we have multiple data frames that have the same variables, and we want to concatenate them. For instance, here are three small data frames:

nightgarden_top <- read_csv("./data/nightgarden_top.csv")
nightgarden_middle <- read_csv("./data/nightgarden_middle.csv")
nightgarden_bottom <- read_csv("./data/nightgarden_bottom.csv")

Each of these contains a subset of the rows we need:

nightgarden_middle
## # A tibble: 2 x 3
##    line speaker     utterance
##   <dbl> <chr>       <chr>    
## 1     6 tombliboo   oo       
## 2     7 makka-pakka pip

We can use dplyr::bind_rows() to concatenate these data frames vertically:

bind_rows(nightgarden_top, nightgarden_middle, nightgarden_bottom)
## # A tibble: 10 x 3
##     line speaker     utterance
##    <dbl> <chr>       <chr>    
##  1     1 upsy-daisy  pip      
##  2     2 upsy-daisy  pip      
##  3     3 upsy-daisy  onk      
##  4     4 upsy-daisy  onk      
##  5     5 tombliboo   ee       
##  6     6 tombliboo   oo       
##  7     7 makka-pakka pip      
##  8     8 makka-pakka pip      
##  9     9 makka-pakka onk      
## 10    10 makka-pakka onk

There is an analogous function dplyr::bind_columns() that concatenates horizontally.

16.6.2 Joining

More complicated situations arise when each data frame potentially contains information about the same cases and variables, but not necessarily all the same ones. A common situation in which this arises in real research is when participants complete two different tasks (e.g. a questionnaire and an experimental task) and the data are stored separately from each tasks. Ideally we might hope to have data from every participant for do both versions, but that does not always occur. As a toy example, suppose we collected some demographics for major characters from Terry Pratchett’s Discworld novels:

demographics <- read_csv("./data/discworld_demographics.csv")
demographics
## # A tibble: 4 x 3
##   id        gender residence   
##   <chr>     <chr>  <chr>       
## 1 rincewind male   ankh-morpork
## 2 vimes     male   ankh-morpork
## 3 granny    female lancre      
## 4 death     other  other

Only some of the characters completed this questionnaire however. We might also have obtained responses from Discworld characters to some experimental task, and again we have partial data:

responses <- read_csv("./data/discworld_responses.csv")
responses
## # A tibble: 3 x 4
##   id     answer1 answer2 answer3
##   <chr>  <chr>   <chr>   <chr>  
## 1 vimes  yes     no      no     
## 2 granny no      no      no     
## 3 brutha maybe   maybe   yes

How should we join these two data frames together? In general there is no right or wrong answer to this question, merely different possibilities that will be of use in different situations:

  • An inner join contains only those rows that appear in both data frames
  • A full join contains every row that appears in at least one data frame
  • A left join contains every row that appears in the first data frame
  • A right join contains every row that appears in the second data frame

In order to work out which rows are matched in the different data frames, the default is to use any variable name that appears in both data sets (though this can be customised). For the Discworld data, the only two characters that appear in both the demographics data set and the reponses data are Vimes and Granny, so when we construct an inner join, we end up with a data frame that includes only those two characters:

inner_join(demographics, responses)
## Joining, by = "id"
## # A tibble: 2 x 6
##   id     gender residence    answer1 answer2 answer3
##   <chr>  <chr>  <chr>        <chr>   <chr>   <chr>  
## 1 vimes  male   ankh-morpork yes     no      no     
## 2 granny female lancre       no      no      no

By contrast, if we construct a full join, every character is included in the result, with NA values inserted for all the missing cases:

full_join(demographics, responses)
## Joining, by = "id"
## # A tibble: 5 x 6
##   id        gender residence    answer1 answer2 answer3
##   <chr>     <chr>  <chr>        <chr>   <chr>   <chr>  
## 1 rincewind male   ankh-morpork <NA>    <NA>    <NA>   
## 2 vimes     male   ankh-morpork yes     no      no     
## 3 granny    female lancre       no      no      no     
## 4 death     other  other        <NA>    <NA>    <NA>   
## 5 brutha    <NA>   <NA>         maybe   maybe   yes

Suppose we want to retain data only for those characters that completed the demographics survey. A left join is what we need here:

left_join(demographics, responses)
## Joining, by = "id"
## # A tibble: 4 x 6
##   id        gender residence    answer1 answer2 answer3
##   <chr>     <chr>  <chr>        <chr>   <chr>   <chr>  
## 1 rincewind male   ankh-morpork <NA>    <NA>    <NA>   
## 2 vimes     male   ankh-morpork yes     no      no     
## 3 granny    female lancre       no      no      no     
## 4 death     other  other        <NA>    <NA>    <NA>

By analogy, if we wanted to retain data only for those characters that provided responses, we would use the right_join() function. It’s also worth noting that there also semi_join() and anti_join() functions that can handle other kinds of situations, but I won’t go into those here.

16.7 More resources