When I introduced variables earlier, I showed you how we can use them to store a single number. In this section, we’ll extend this idea and look at how to store multiple numbers within the one variable. In R the name for a variable that can store multiple values is a vector. So let’s create one.
Let’s return to the example we were working with in the previous section on variables. We’re designing a survey, and we want to keep track of the responses that a participant has given. This time, let’s imagine that we’ve finished running the survey and we’re examining the data. Suppose we’ve administered the Depression, Anxiety and Stress Scale (DASS) and as a consequence every participant has scores for on the depression, anxiety and stress scales provided by the DASS. One thing we might want to do is create a single variable called scale_name
that identifies the three scales. The simplest way to do this in R is to use the combine function, c
.1 To do so, all we have to do is type the values we want to store in a comma separated list, like this
scale_name <- c("depression","anxpiety","stress")
scale_name
## [1] "depression" "anxpiety" "stress"
To use the correct terminology here, we have a single variable here called scale_name
: this variable is a vector that has three elements. Because the vector contains text, it is a character vector. You can use the length
function to check the length, and the class
function to check what kind of vector it is:
length(scale_name)
## [1] 3
class(scale_name)
## [1] "character"
As you might expect, we can define numeric or logical variables in the same way. For instance, we could define the raw scores on the three DASS scales like so:
raw_score <- c(12, 3, 8)
raw_score
## [1] 12 3 8
We’ll talk about logical vectors in a moment.
If I want to extract the first element from the vector, all I have to do is refer to the relevant numerical index, using square brackets to do so. For example, to get the first element of scale_name
I would type this
scale_name[1]
## [1] "depression"
The second element of the vector is
scale_name[2]
## [1] "anxpiety"
You get the idea.2
There are a few ways to extract multiple elements of a vector. The first way is to specify a vector that contains the indices of the variables that you want to keep. To extract the first two scale names:
scale_name[c(1,2)]
## [1] "depression" "anxpiety"
Alternatively, R provides a convenient shorthand notation in which 1:2
is a vector containing the nubmers from 1 to 2, and similarly 1:10
is a vector containing the numbers from 1 to 10. So this is also the same:
scale_name[1:2]
## [1] "depression" "anxpiety"
Notice that order matters here. So if I do this
scale_name[c(2,1)]
## [1] "anxpiety" "depression"
I get the same numbers, but in the reverse order.
Finally, when working with vectors, R allows us to use negative numbers to indicate which elements to remove. So this is yet another way of doing the same thing:
scale_name[-3]
## [1] "depression" "anxpiety"
Notice that done of this has changed the original variable. The scale_name
itself has remained completely untouched.
scale_name
## [1] "depression" "anxpiety" "stress"
Sometimes you’ll want to change the values stored in a vector. Imagine my surprise when a student points out that "anxpiety"
is not in fact a real thing. I should probably fix that 😳. One possibility would be to assign the whole vector again from the beginning, using c
. But that’s a lot of typing. Also, it’s a little wasteful: why should R have to redefine the names for all three scales, when only the second one is wrong? Fortunately, we can tell R to change only the second element, using this trick:
scale_name[2] <- "anxiety"
scale_name
## [1] "depression" "anxiety" "stress"
That’s better.
Another way to edit variables in is to use the edit
function. I won’t go into that here, but if you’re curious, try typing a command like this:
edit(scale_name)
One very handy thing in R is that it lets you assign meaningul names to the different elements in a vector. For example, the raw_scores
vector that we introduced earlier contains the actual data from a study but when you print it out on its own
raw_score
## [1] 12 3 8
its not obvious what each of the scores corresponds to. There are several different ways of making this a little more meaningful (and we’ll talk about them later) but for now I want to show one simple trick. Ideally, what we’d like to do is have R remember that the first element of the raw_score
is the “depression” score, the second is “anxiety” and the third is “stress”. We can do that like this:
names(raw_score) <- scale_name
This is a bit of an unusual looking assignment statement. Usually, whenever we use <-
the thing on the left hand side is the variable itself (i.e., raw_score
) but this time around the left hand side refers to the names. To see what this command has done, let’s get R to print out the raw_score
variable now:
raw_score
## depression anxiety stress
## 12 3 8
That’s a little nicer. Element names don’t just look nice, they’re functional too. You can refer to the elements of a vector using their names, like so:
raw_score["anxiety"]
## anxiety
## 3
One really nice thing about vectors is that a lot of R functions and operators will work on the whole vector at once. For instance, suppose I want to normalise the raw scores from the DASS. Each scale of the DASS is constructed from 14 questions that are rated on a 0-3 scale, so the minimum possible score is 0 and the maximum is 42. Suppose I wanted to rescale the raw scores to lie on a scale from 0 to 1. I can create the scaled_score
variable like this:
scaled_score <- raw_score / 42
scaled_score
## depression anxiety stress
## 0.28571429 0.07142857 0.19047619
In other words, when you divide a vector by a single number, all elements in the vector get divided. The same is true for addition, subtraction, multiplicattion and taking powers. So that’s neat.
Suppose it later turned out that I’d made a mistake. I hadn’t in fact administered the complete DASS, only the first page. As noted in the DASS website, it’s possible to fix this mistake (sort of). First, I have to recognise that my scores are actually out of 21 not 42, so the calculation I should have done is this:
scaled_score <- raw_score / 21
scaled_score
## depression anxiety stress
## 0.5714286 0.1428571 0.3809524
Then, it turns out that page 1 of the full DASS is almost the same as the short form of the DASS, but there’s a correction factor you have to apply. The depression score needs to be multiplied by 1.04645, the anxiety score by 1.02284, and stress by 0.98617
correction_factor <- c(1.04645, 1.02284, 0.98617)
corrected_score <- scaled_score * correction_factor
corrected_score
## depression anxiety stress
## 0.5979714 0.1461200 0.3756838
What this has done is multiply the first element of scaled_score
by the first element of correction_factor
, multiply the second element of scaled_score
by the second element of correction_factor
, and so on.
I’ll talk more about calculations involving vectors later, because they come up a lot. In particular R has a thing called the recycling rule that is worth knowing about.3 But that’s enough detail for now.
I mentioned earlier that we can define vectors of logical values in the same way that we can store vectors of numbers and vectors of text, again using the c
function to combine multiple values. Logical vectors can be useful as data in their own right, but the thing that they’re expecially useful for is extracting elements of another vector, which is referred to as logical indexing.
Here’s a simple example. Suppose I decide that the stress scale is not very useful for my study, and I only want to keep the first two elements, depression and anxiety. One way to do this is to define a logical vector that indicates which values to keep
:
keep <- c(TRUE, TRUE, FALSE)
keep
## [1] TRUE TRUE FALSE
In this instance the keep
vector indicates that it is TRUE
that I want to retain the first two elements, and FALSE
that I want to keep the third. So if I type this
corrected_score[keep]
## depression anxiety
## 0.5979714 0.1461200
R prints out the corrected scores for the two variables only. As usual, note that this hasn’t changed the original variable. If I print out the original vector…
corrected_score
## depression anxiety stress
## 0.5979714 0.1461200 0.3756838
… all three values are still there. If I do want to create a new variable, I need to explicitly assign the results of my previous command to a variable.
Let’s suppose that I want to call the new variable short_score
, indicating that I’ve only retained some of the scales. Here’s how I do that:
short_score <- corrected_score[keep]
short_score
## depression anxiety
## 0.5979714 0.1461200
At this point, I hope you can see why logical indexing is such a useful thing. It’s a very basic, yet very powerful way to manipulate data. For intance, I might want to extract the scores of the adult participants in a study, which would probably involve a command like scores[age > 18]
. The operation age > 18
would return a vector of TRUE
and FALSE
values, and so the the full command scores[age > 18]
would return only the scores
for participants with age > 18
. It does take practice to become completely comfortable using logical indexing, so it’s a good idea to play around with these sorts of commands. Practice makes perfect, and it’s only by practicing logical indexing that you’ll perfect the art of yelling frustrated insults at your computer.
c
to create a numeric vector called age
that lists the ages of four people (e.g., 19, 34, 7 and 67)[]
to print out the age
of the second person.[]
to print out the age
of the second person and third personsc
to create a character vector called gender
that lists the gender of those four peopleadult
that indicates whether each participant was 18 or older. Instead of using c
, try using a logical operator like >
or >=
to automatically create adult
from age
gender
of all the adult
participants.The solutions for these exercises are here
Notice that I didn’t specify any argument names here. The c
function is one of those cases where we don’t use names. We just type all the numbers, and R just dumps them all in a single variable.↩︎
Note that the square brackets here are used to index the elements of the vector, and that this is the same notation that we see in the R output. That’s not accidental: when R prints [1] "depression"
to the screen what it’s saying is that "depression"
is the first element of the output. When the output is long enough, you’ll often see other numbers at the start of each line of the output.↩︎
The recycling rule: if two vectors are of unequal length, the values of shorter one will be “recycled”. To get a feel for how this works, try setting x <- c(1,1,1,1,1)
and y <- c(2,7)
and then getting R to evaluate x + y
↩︎