25.1 Twitter client

Researchers interested in social networks often scrape data from sites such as Twitter in order to obtain data. This is relatively easy to do in R, using a package like twitteR, which provides an interface with the Twitter web API.

25.1.1 Setting up twitteR

It’s fairly easy to get set up (e.g. this post):

  1. Install the twitteR package
  2. Get a twitter account
    • I have @lsrbook for this
    • you do need to add a mobile number (for Australian numbers, drop the leading 0)
  3. Go to https://apps.twitter.com (sign in with your account)
  4. Click on “create new app”
  5. Enter the information it requests: you need a name, a description, a website. For my client I set
  6. Agree to terms and conditions

At this point the app is created on the Twitter side. You’ll need to get some information to allow R access to the app:

  1. Click on “key & access token” tab and get the following:
    • Consumer Key (API Key)
    • Consumer Secret (API Secret)
  2. Go to the “create my access token” button:
    • Access Token
    • Access Token Secret

This is all R needs so go back to R and enter the following:

consumer_key <- "XXXXX"
consumer_secret <- "XXXXX"
access_token <- "XXXXX"
access_secret <- "XXXXX"

where the "XXXXX" values are of course the keys you just downloaded. Within R the command to authorise the twitteR package access to the app is:

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

Now we’re ready to go!

25.1.2 Using the Twitter client

Okay, so I guess people like to tweet about #thedrum so I decided to search for the last 10000 tweets containing that term. It’s easy to do:

library(twitteR)

drumtweet10k <- searchTwitter(
  searchString = "thedrum", 
  n=10000
)

The raw data are saved in the dt10k.Rdata. The format of the data is a little complicated, so I did a tiny bit of text processing and tidying, and then saved the results to dt10k-sml.Rdata. Let’s take a look:

load("./data/dt10k-sml.Rdata")
library(lsr)
who()
##    -- Name --   -- Class --   -- Size --
##    dt10k_stp    character     116860    
##    dt10k_txt    character     10000     
##    freq         table         16529     
##    stopchars    character     25        
##    stopwords    character     120

In the full data set the twitter client has downloaded a lot of information about each tweet, but in this simpler versionm dt10k_txt variable contains only the raw text of each tweet. Here’s the first few tweets:

dt10k_txt[1:5]
## [1] "RT @deniseshrivell: Joyce and the nationals deserve to be called out &amp; held to account on their complete lack of policy in the public inter…"
## [2] "Omnicom promotes Mark Halliday to CEO of perfomance and announces departure of Lee Smith https://t.co/ePom6lnAeX"                                
## [3] "10 questions with… Sarah Ronald, founding director of Nile https://t.co/ZZQlPMe6fr"                                                              
## [4] "Reeha Alder Shah, managing partner for AnalogFolk: BAME professionals should 'embrace their differences' https://t.co/QXKGMqnGvI"                
## [5] "Xiaomi and Microsoft sign deal to collaborate on AI, cloud and hardware https://t.co/vkelwHc3a2"

The dt10k_stp vector concatenates all the tweets, splits them so that each word is a separate element, after removing punctuation characters, converting everthing to lower case, and removing certain stopwords that are very high frequency but rarely interesting:

dt10k_stp[1:50]
##  [1] "rt"                 "@deniseshrivell"   
##  [3] "joyce"              "nationals"         
##  [5] "deserve"            "called"            
##  [7] "out"                "held"              
##  [9] "account"            "complete"          
## [11] "lack"               "policy"            
## [13] "public"             "inter"             
## [15] "omnicom"            "promotes"          
## [17] "mark"               "halliday"          
## [19] "ceo"                "perfomance"        
## [21] "announces"          "departure"         
## [23] "lee"                "smith"             
## [25] "httpstcoepom6lnaex" "10"                
## [27] "questions"          "sarah"             
## [29] "ronald"             "founding"          
## [31] "director"           "nile"              
## [33] "httpstcozzqlpme6fr" "reeha"             
## [35] "alder"              "shah"              
## [37] "managing"           "partner"           
## [39] "analogfolk"         "bame"              
## [41] "professionals"      "embrace"           
## [43] "differences"        "httpstcoqxkgmqngvi"
## [45] "xiaomi"             "microsoft"         
## [47] "sign"               "deal"              
## [49] "collaborate"        "ai"

The freq variable is a frequency table for the words in dt10k_stp, sorted in order of decreasing frequency. Here are the top 20 words:

names(freq[1:20])
##  [1] "rt"               "@thedrum"         "#thedrum"        
##  [4] "via"              "@abcthedrum"      "@stopadanicairns"
##  [7] "new"              "#auspol"          "more"            
## [10] "people"           "advertising"      "barnaby"         
## [13] "#qldpol"          "marketing"        "ad"              
## [16] "up"               "out"              "dont"            
## [19] "joyce"            "time"

Just to illustrate that there is psychological value in this approach, here’s the standard “rank-frequency plot”, showing the signature (approximately) power-law behaviour. There are a few extremely common words, and then a very long tail of low frequency words. Variations on this pattern are ubiquitous in natural language:

plot(
  x = freq[1:100], 
  xlab="Word", 
  ylab="Frequency"
)

That said, I do wonder how much of this data set is spam. There seem to be a lot of tweets about blockchain in there, which does make me suspicious. I may have to revisit this example in the future!