Data manipulation isn’t a particularly well defined topic. I take it to cover a few topics:
Earlier sections of the notes covered variations of this:
In the data visualisation we used a data set looking at inductive reasoning stored in the frames_ex2.csv file. Here it is again:
frames <- read_csv("./data/frames_ex2.csv")
frames
## # A tibble: 4,725 x 8
## id gender age condition sample_size n_obs test_item
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 male 36 category small 2 1
## 2 1 male 36 category small 2 2
## 3 1 male 36 category small 2 3
## 4 1 male 36 category small 2 4
## 5 1 male 36 category small 2 5
## 6 1 male 36 category small 2 6
## 7 1 male 36 category small 2 7
## 8 1 male 36 category medium 6 1
## 9 1 male 36 category medium 6 2
## 10 1 male 36 category medium 6 3
## # … with 4,715 more rows, and 1 more variable: response <dbl>
In this file the data are stored in long form, in which there is one row for every trial in the experiment and thus the data from each participant are spread across several rows. I stored the data in that format because it was handy for drawing pictures, but as it happen this isn’t how the data set was stored when I extracted it from the OSF repository in which I’d deposited it only a few months earlier! Here’s what it looked like when I found it:
wide_frames <- read_csv("./data/frames_ex2_wide.csv")
wide_frames
## # A tibble: 225 x 25
## ID Gender Age Sampling `SS2-R1` `SS2-R2` `SS2-R3`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 male 36 category 8 7 6
## 2 2 male 46 category 7 5 5
## 3 3 female 33 property 8 7 7
## 4 4 female 71 property 6 7 6
## 5 5 female 23 property 9 9 9
## 6 6 female 31 category 7 7 7
## 7 7 male 23 property 6 7 5
## 8 8 female 31 property 4 4 2
## 9 9 female 37 category 9 9 7
## 10 10 female 46 category 8 6 4
## # … with 215 more rows, and 18 more variables: SS2-R4 <dbl>,
## # SS2-R5 <dbl>, SS2-R6 <dbl>, SS2-R7 <dbl>, SS6-R1 <dbl>,
## # SS6-R2 <dbl>, SS6-R3 <dbl>, SS6-R4 <dbl>, SS6-R5 <dbl>,
## # SS6-R6 <dbl>, SS6-R7 <dbl>, SS12-R1 <dbl>, SS12-R2 <dbl>,
## # SS12-R3 <dbl>, SS12-R4 <dbl>, SS12-R5 <dbl>, SS12-R6 <dbl>,
## # SS12-R7 <dbl>
In this file, the data are stored in wide form. The data frame contains only one row per person, and each judgment they make is stored as a separate variable. Wide and long form data are useful for a number of purposes. How do we switch between them, “reshaping” the data to the form we need?
Once upon a time this used to be a difficult task, so much so that I wrote my own (not very good) functions that would solve this problem in a way that didn’t expose my students to some of the more frustrating aspects of data manipulation. Thankfully, this has changed. There is a little package distributed with the tidyverse called tidyr and it is remarkably effective at solving a range of reshaping problems in a (fairly) intuitive way.
long_frames <- wide_frames %>%
gather(key = "query", value="response",
"SS2-R1", "SS2-R2","SS2-R3","SS2-R4","SS2-R5","SS2-R6","SS2-R7",
"SS6-R1","SS6-R2","SS6-R3","SS6-R4","SS6-R5","SS6-R6","SS6-R7",
"SS12-R1","SS12-R2","SS12-R3","SS12-R4","SS12-R5","SS12-R6","SS12-R7")
long_frames
## # A tibble: 4,725 x 6
## ID Gender Age Sampling query response
## <dbl> <chr> <dbl> <chr> <chr> <dbl>
## 1 1 male 36 category SS2-R1 8
## 2 2 male 46 category SS2-R1 7
## 3 3 female 33 property SS2-R1 8
## 4 4 female 71 property SS2-R1 6
## 5 5 female 23 property SS2-R1 9
## 6 6 female 31 category SS2-R1 7
## 7 7 male 23 property SS2-R1 6
## 8 8 female 31 property SS2-R1 4
## 9 9 female 37 category SS2-R1 9
## 10 10 female 46 category SS2-R1 8
## # … with 4,715 more rows
To go back the other way, we could do this:
long_frames %>% spread(key = "query", value = "response")
## # A tibble: 225 x 25
## ID Gender Age Sampling `SS12-R1` `SS12-R2` `SS12-R3`
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 male 36 category 8 7 6
## 2 2 male 46 category 9 9 8
## 3 3 female 33 property 8 9 9
## 4 4 female 71 property 9 8 3
## 5 5 female 23 property 9 9 9
## 6 6 female 31 category 9 9 8
## 7 7 male 23 property 8 7 6
## 8 8 female 31 property 8 8 7
## 9 9 female 37 category 9 8 1
## 10 10 female 46 category 9 9 8
## # … with 215 more rows, and 18 more variables: SS12-R4 <dbl>,
## # SS12-R5 <dbl>, SS12-R6 <dbl>, SS12-R7 <dbl>, SS2-R1 <dbl>,
## # SS2-R2 <dbl>, SS2-R3 <dbl>, SS2-R4 <dbl>, SS2-R5 <dbl>,
## # SS2-R6 <dbl>, SS2-R7 <dbl>, SS6-R1 <dbl>, SS6-R2 <dbl>,
## # SS6-R3 <dbl>, SS6-R4 <dbl>, SS6-R5 <dbl>, SS6-R6 <dbl>,
## # SS6-R7 <dbl>
The long_frames
data frame we created is close to what we need, but the query
variable isn’t quite right. A value of "SS2-R1"
corresponds to a trial on which the sample size was 2 and the test item was 1. This should properly be two variables, one specifying the sample_size
and the other specifying the value of the test_item
.
long_frames %>%
separate(
col = query,
into = c("sample_size","test_item"),
sep = "-R"
)
## # A tibble: 4,725 x 7
## ID Gender Age Sampling sample_size test_item response
## <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 male 36 category SS2 1 8
## 2 2 male 46 category SS2 1 7
## 3 3 female 33 property SS2 1 8
## 4 4 female 71 property SS2 1 6
## 5 5 female 23 property SS2 1 9
## 6 6 female 31 category SS2 1 7
## 7 7 male 23 property SS2 1 6
## 8 8 female 31 property SS2 1 4
## 9 9 female 37 category SS2 1 9
## 10 10 female 46 category SS2 1 8
## # … with 4,715 more rows
Getting closer. Note that there is a unite()
function that reverses this operation, so if you ever need to collapse multiple columns into a single variable that would be the way to do it. It’s also worth noting that the two new variables aren’t quite in the format we want them to be. For instance, the test_item
variable should be numeric rather than character, so lets mutate
that and store the results:
long_frames <- long_frames %>%
separate(
col = query,
into = c("sample_size","test_item"),
sep = "-R"
) %>%
mutate(test_item = as.numeric(test_item))
long_frames
## # A tibble: 4,725 x 7
## ID Gender Age Sampling sample_size test_item response
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 male 36 category SS2 1 8
## 2 2 male 46 category SS2 1 7
## 3 3 female 33 property SS2 1 8
## 4 4 female 71 property SS2 1 6
## 5 5 female 23 property SS2 1 9
## 6 6 female 31 category SS2 1 7
## 7 7 male 23 property SS2 1 6
## 8 8 female 31 property SS2 1 4
## 9 9 female 37 category SS2 1 9
## 10 10 female 46 category SS2 1 8
## # … with 4,715 more rows
recode()
What should we do with the sample_size
variable?
long_frames <- long_frames %>%
mutate(sample_size = dplyr::recode(
sample_size,
"SS2" = "small",
"SS6" = "medium",
"SS12" = "large")
)
long_frames
## # A tibble: 4,725 x 7
## ID Gender Age Sampling sample_size test_item response
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 male 36 category small 1 8
## 2 2 male 46 category small 1 7
## 3 3 female 33 property small 1 8
## 4 4 female 71 property small 1 6
## 5 5 female 23 property small 1 9
## 6 6 female 31 category small 1 7
## 7 7 male 23 property small 1 6
## 8 8 female 31 property small 1 4
## 9 9 female 37 category small 1 9
## 10 10 female 46 category small 1 8
## # … with 4,715 more rows
case_when()
An alternative method is to use case_when()
.
long_frames <- long_frames %>%
mutate(n_obs = case_when(
sample_size == "small" ~ 2,
sample_size == "medium" ~ 6,
sample_size == "large" ~ 12
))
long_frames
## # A tibble: 4,725 x 8
## ID Gender Age Sampling sample_size test_item response
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 male 36 category small 1 8
## 2 2 male 46 category small 1 7
## 3 3 female 33 property small 1 8
## 4 4 female 71 property small 1 6
## 5 5 female 23 property small 1 9
## 6 6 female 31 category small 1 7
## 7 7 male 23 property small 1 6
## 8 8 female 31 property small 1 4
## 9 9 female 37 category small 1 9
## 10 10 female 46 category small 1 8
## # … with 4,715 more rows, and 1 more variable: n_obs <dbl>
Now we have a data frame that is almost identical to the one we imported in the previous section. Apart from some cosmetic changes to variable names and order, the job is complete!
Empirical data sets often include date fields. Sometimes these dates are nicely formatted and are easy to work with. Most of the time they are not. Even if they were, dates are inherently painful: a year isn’t a perfect multiple of days, so a solar calendar needs leap years and other corrections. The year doesn’t divide easily into 12 months so months have a variable number of days and doing “addition” with months is really weird. The week days don’t distribute themelves cleanly across months or years, so those are a headache too 🤕. It’s not purely an issue of nomenclature. We organise our lives around seasons, weekdays/weekends, day/night and various other cycles that don’t always play nicely with one another. The weirdness of dates reflects the structure of the world and our lives, and data analysts can’t pretend that this complexity isn’t there. To make things more annoying there are cultural and lingustic variations even within the English speaking world, so that is an additional source of ambiguity. Does 9/11 refer to September 11 or November 9? It depends on where you live. 🤷. Because of all this, date specification is hard. The Date
class in R provides a relatively decent way to handle this kind of data, and the lubridate package (another tidyverse one!) provides a very helpful tool to make it easier. It doesn’t load automatically so lets do that now:
library(lubridate)
Date information is often specified in terms of “year”, “month” and “day”, though not always in that order. lubridate provides a collection of functions ymd
, mdy
, etc that are fairly smart and able to parse information in many different formats. As long as you know the order of the information, these functions work really well:
ymd(20101215)
ymd("2010/12/15")
ymd("2010 December 15")
## [1] "2010-12-15"
## [1] "2010-12-15"
## [1] "2010-12-15"
Despite the variations in formatting the function is able to work out that these all correspond to the same date and returns an variable of class Date
. This gives you quite a bit of flexibility:
ymd(20101215) == mdy("12/15/10")
## [1] TRUE
Nice! The functions are smart enough to handle strings and numbers as input, and if the year is indeterminate it assumes a 21st century date. That said, they do rely on you knowing the correct order. The same numeric input, formatted differently…
dmy(20101215)
## [1] "1215-10-20"
… might indeed correspond to 20th October 1215. This might be plausible if the data set somehow pertained to the Magna Carta, but probably not for most settings of interest to psychologists!
To give you a sense of how annoying working with dates can be, the following is a real example. I asked my partner – who in real life is a statistics consultant among other things – to send me a data set she’d been sent with poorly formatted date information. This is what she sent me:
oddball <- readLines("./data/oddballdate2.csv")
oddball
## [1] "4/4/16 9:00" "4/4/16 14:13" "6/4/16 5:55"
## [4] "6/4/16 11:27" "8/4/16 15:04" "12/4/16 12:29"
## [7] "14/4/16 12:14" "14/4/16 19:16" "18/4/16 16:18"
## [10] "20/4/16 11:10" "28/4/16 16:10" "28/4/16"
## [13] "1/5/16 17:11" "2/5/16 14:53" "3/5/16 10:32"
## [16] "3/5/16 12:46" "3/5/16 13:08" "3/5/16 15:51"
## [19] "4/5/16 11:04" "4/5/16 11:13" "4/5/16 11:16"
## [22] "4/5/16 12:54" "4/5/16 14:40" "4/5/16 20:16"
## [25] "6/5/16 15:13" "10/5/16 11:17" "11/5/16 12:41"
## [28] "12/5/16" "12/5/16" "5/13/16"
## [31] "16/5/16 10:11" "16/5/16 22:47" "17/5/16 15:20"
## [34] "18/5/16" "18/5/16 10:32" "19/5/16 2:27"
## [37] "19/5/16 10:59" "19/5/16"
To keep this example minimal, I’ve used readLines()
to import the data as a single character vector. As you can see this pretty inconsistent, and my partner’s face when I turned to her in horror can only be described with emoji: 😈
A brief look at this suggests that most of the data have a time stamp as well as date information, which suggests that the dmy_hm
function should be able to parse them. Others are date only, suggesting that dmy
should work. If I try these individually this is what happens:
dmy(oddball)
## Warning: 33 failed to parse.
## [1] NA NA NA NA
## [5] NA NA NA NA
## [9] NA NA NA "2016-04-28"
## [13] NA NA NA NA
## [17] NA NA NA NA
## [21] NA NA NA NA
## [25] NA NA NA "2016-05-12"
## [29] "2016-05-12" NA NA NA
## [33] NA "2016-05-18" NA NA
## [37] NA "2016-05-19"
dmy_hm(oddball)
## Warning: 6 failed to parse.
## [1] "2016-04-04 09:00:00 UTC" "2016-04-04 14:13:00 UTC"
## [3] "2016-04-06 05:55:00 UTC" "2016-04-06 11:27:00 UTC"
## [5] "2016-04-08 15:04:00 UTC" "2016-04-12 12:29:00 UTC"
## [7] "2016-04-14 12:14:00 UTC" "2016-04-14 19:16:00 UTC"
## [9] "2016-04-18 16:18:00 UTC" "2016-04-20 11:10:00 UTC"
## [11] "2016-04-28 16:10:00 UTC" NA
## [13] "2016-05-01 17:11:00 UTC" "2016-05-02 14:53:00 UTC"
## [15] "2016-05-03 10:32:00 UTC" "2016-05-03 12:46:00 UTC"
## [17] "2016-05-03 13:08:00 UTC" "2016-05-03 15:51:00 UTC"
## [19] "2016-05-04 11:04:00 UTC" "2016-05-04 11:13:00 UTC"
## [21] "2016-05-04 11:16:00 UTC" "2016-05-04 12:54:00 UTC"
## [23] "2016-05-04 14:40:00 UTC" "2016-05-04 20:16:00 UTC"
## [25] "2016-05-06 15:13:00 UTC" "2016-05-10 11:17:00 UTC"
## [27] "2016-05-11 12:41:00 UTC" NA
## [29] NA NA
## [31] "2016-05-16 10:11:00 UTC" "2016-05-16 22:47:00 UTC"
## [33] "2016-05-17 15:20:00 UTC" NA
## [35] "2016-05-18 10:32:00 UTC" "2016-05-19 02:27:00 UTC"
## [37] "2016-05-19 10:59:00 UTC" NA
Each function converts the dates it can, inserts NA
for the dates it can’t, and loudly complain that something is amiss. That’s pretty reasonable.
d1 <- oddball %>% dmy_hm() %>% date()
d2 <- oddball %>% dmy()
clean_dates <- case_when(
is.na(d1) ~ d2,
TRUE ~ d1
)
clean_dates
## [1] "2016-04-04" "2016-04-04" "2016-04-06" "2016-04-06"
## [5] "2016-04-08" "2016-04-12" "2016-04-14" "2016-04-14"
## [9] "2016-04-18" "2016-04-20" "2016-04-28" "2016-04-28"
## [13] "2016-05-01" "2016-05-02" "2016-05-03" "2016-05-03"
## [17] "2016-05-03" "2016-05-03" "2016-05-04" "2016-05-04"
## [21] "2016-05-04" "2016-05-04" "2016-05-04" "2016-05-04"
## [25] "2016-05-06" "2016-05-10" "2016-05-11" "2016-05-12"
## [29] "2016-05-12" NA "2016-05-16" "2016-05-16"
## [33] "2016-05-17" "2016-05-18" "2016-05-18" "2016-05-19"
## [37] "2016-05-19" "2016-05-19"
Well that almost worked. What’s the one NA
value?
oddball[is.na(clean_dates)]
## [1] "5/13/16"
Sigh. Looking at all the other dates, every single one is in April or May… but this last one is ostensibly on the 5th day of the 13th month? Obviously we have one entry formatted in a mdy()
form. I will leave it as an exercise to the reader to work out how to fix that one 😀
Another common data manipulation problem arises when we need to combine multiple data frames. Two versions of this problem are worth talking about, namely binding and joining
Binding data frames is simple. Suppose we have multiple data frames that have the same variables, and we want to concatenate them. For instance, here are three small data frames:
nightgarden_top <- read_csv("./data/nightgarden_top.csv")
nightgarden_middle <- read_csv("./data/nightgarden_middle.csv")
nightgarden_bottom <- read_csv("./data/nightgarden_bottom.csv")
Each of these contains a subset of the rows we need:
nightgarden_middle
## # A tibble: 2 x 3
## line speaker utterance
## <dbl> <chr> <chr>
## 1 6 tombliboo oo
## 2 7 makka-pakka pip
We can use dplyr::bind_rows()
to concatenate these data frames vertically:
bind_rows(nightgarden_top, nightgarden_middle, nightgarden_bottom)
## # A tibble: 10 x 3
## line speaker utterance
## <dbl> <chr> <chr>
## 1 1 upsy-daisy pip
## 2 2 upsy-daisy pip
## 3 3 upsy-daisy onk
## 4 4 upsy-daisy onk
## 5 5 tombliboo ee
## 6 6 tombliboo oo
## 7 7 makka-pakka pip
## 8 8 makka-pakka pip
## 9 9 makka-pakka onk
## 10 10 makka-pakka onk
There is an analogous function dplyr::bind_columns()
that concatenates horizontally.
More complicated situations arise when each data frame potentially contains information about the same cases and variables, but not necessarily all the same ones. A common situation in which this arises in real research is when participants complete two different tasks (e.g. a questionnaire and an experimental task) and the data are stored separately from each tasks. Ideally we might hope to have data from every participant for do both versions, but that does not always occur. As a toy example, suppose we collected some demographics
for major characters from Terry Pratchett’s Discworld novels:
demographics <- read_csv("./data/discworld_demographics.csv")
demographics
## # A tibble: 4 x 3
## id gender residence
## <chr> <chr> <chr>
## 1 rincewind male ankh-morpork
## 2 vimes male ankh-morpork
## 3 granny female lancre
## 4 death other other
Only some of the characters completed this questionnaire however. We might also have obtained responses
from Discworld characters to some experimental task, and again we have partial data:
responses <- read_csv("./data/discworld_responses.csv")
responses
## # A tibble: 3 x 4
## id answer1 answer2 answer3
## <chr> <chr> <chr> <chr>
## 1 vimes yes no no
## 2 granny no no no
## 3 brutha maybe maybe yes
How should we join these two data frames together? In general there is no right or wrong answer to this question, merely different possibilities that will be of use in different situations:
In order to work out which rows are matched in the different data frames, the default is to use any variable name that appears in both data sets (though this can be customised). For the Discworld data, the only two characters that appear in both the demographics
data set and the reponses
data are Vimes and Granny, so when we construct an inner join, we end up with a data frame that includes only those two characters:
inner_join(demographics, responses)
## Joining, by = "id"
## # A tibble: 2 x 6
## id gender residence answer1 answer2 answer3
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 vimes male ankh-morpork yes no no
## 2 granny female lancre no no no
By contrast, if we construct a full join, every character is included in the result, with NA
values inserted for all the missing cases:
full_join(demographics, responses)
## Joining, by = "id"
## # A tibble: 5 x 6
## id gender residence answer1 answer2 answer3
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 rincewind male ankh-morpork <NA> <NA> <NA>
## 2 vimes male ankh-morpork yes no no
## 3 granny female lancre no no no
## 4 death other other <NA> <NA> <NA>
## 5 brutha <NA> <NA> maybe maybe yes
Suppose we want to retain data only for those characters that completed the demographics
survey. A left join is what we need here:
left_join(demographics, responses)
## Joining, by = "id"
## # A tibble: 4 x 6
## id gender residence answer1 answer2 answer3
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 rincewind male ankh-morpork <NA> <NA> <NA>
## 2 vimes male ankh-morpork yes no no
## 3 granny female lancre no no no
## 4 death other other <NA> <NA> <NA>
By analogy, if we wanted to retain data only for those characters that provided responses
, we would use the right_join()
function. It’s also worth noting that there also semi_join()
and anti_join()
functions that can handle other kinds of situations, but I won’t go into those here.