• Stars
    star
    106
  • Rank 314,511 (Top 7 %)
  • Language
    R
  • Created about 11 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Dev version of streamR package: Access to Twitter Streaming API via R

streamR: Access to Twitter Streaming API via R

This package includes a series of functions that give R users access to Twitter's Streaming API, as well as a tool that parses the captured tweets and transforms them in R data frames, which can then be used in subsequent analyses. streamR requires authentication via OAuth and the ROAuth package.

Current CRAN release is 0.2.1. To install the most updated version (0.4.0) from GitHub, type:

library(devtools)
devtools::install_github("pablobarbera/streamR/streamR")

Click here to read the documentation and here to read the vignette.

Installation and authentication

streamR can be installed directly from CRAN, but the most updated version will always be on GitHub. The code below shows how to install from both sources.

install.packages("streamR")  # from CRAN
devtools::install_github("pablobarbera/streamR/streamR") # from GitHub

streamR requires authentication via OAuth. The same oauth token can be used for both twitteR and streamR. After creating an application here, and obtaining the consumer key and consumer secret, it is easy to create your own oauth credentials using the ROAuth package, which can be saved in disk for future sessions:

library(ROAuth)
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "xxxxxyyyyyzzzzzz"
consumerSecret <- "xxxxxxyyyyyzzzzzzz111111222222"
my_oauth <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, 
    requestURL = requestURL, accessURL = accessURL, authURL = authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
save(my_oauth, file = "my_oauth.Rdata")

Alternatively, you can also create an access token as a list and streamR will automatically do the handshake:

 my_oauth <- list(consumer_key = "CONSUMER_KEY",
   consumer_secret = "CONSUMER_SECRET",
   access_token="ACCESS_TOKEN",
   access_token_secret = "ACCESS_TOKEN_SECRET")

filterStream

filterStream is probably the most useful function. It opens a connection to the Streaming API that will return all tweets that contain one or more of the keywords given in the track argument. We can use this function to, for instance, capture public statuses that mention Obama or Biden:

library(streamR)
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson
load("my_oauth.Rdata")
filterStream("tweets.json", track = c("Obama", "Biden"), timeout = 120, 
  oauth = my_oauth)
## Loading required package: ROAuth
## Loading required package: digest
## Capturing tweets...
## Connection to Twitter stream was closed after 120 seconds with up to 350 tweets downloaded.
tweets.df <- parseTweets("tweets.json", simplify = TRUE)
## 350 tweets have been parsed.

Note that here I'm connecting to the stream for just two minutes, but ideally I should have the connection continuously open, with some method to handle exceptions and reconnect when there's an error. I'm also using OAuth authentication (see below), and storing the tweets in a data frame using the parseTweets function. As I expected, Obama is mentioned more often than Biden at the moment I created this post:

c( length(grep("obama", tweets.df$text, ignore.case = TRUE)),
   length(grep("biden", tweets.df$text, ignore.case = TRUE)) )
## [1] 347  2

Tweets can also be filtered by two additional parameters: follow, which can be used to include tweets published by only a subset of Twitter users, and locations, which will return geo-located tweets sent within bounding boxes defined by a set of coordinates. Using these two options involves some additional complications – for example, the Twitter users need to be specified as a vector of user IDs and not just screen names, and the locations filter is incremental to any keyword in the track argument. For more information, I would suggest to check Twitter's documentation for each parameter.

Here's a quick example of how one would capture and visualize tweets sent from the United States:

filterStream("tweetsUS.json", locations = c(-125, 25, -66, 50), timeout = 300, 
    oauth = my_oauth)
tweets.df <- parseTweets("tweetsUS.json", verbose = FALSE)
library(ggplot2)
library(grid)
map.data <- map_data("state")
points <- data.frame(x = as.numeric(tweets.df$lon), y = as.numeric(tweets.df$lat))
points <- points[points$y > 25, ]
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "white", 
    color = "grey20", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + 
    theme(axis.line = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), 
        axis.title = element_blank(), panel.background = element_blank(), panel.border = element_blank(), 
        panel.grid.major = element_blank(), plot.background = element_blank(), 
        plot.margin = unit(0 * c(-1.5, -1.5, -1.5, -1.5), "lines")) + geom_point(data = points, 
    aes(x = x, y = y), size = 1, alpha = 1/5, color = "darkblue")

Map of tweets

sampleStream

The function sampleStream allows the user to capture a small random sample (around 1%) of all tweets that are being sent at each moment. This can be useful for different purposes, such as estimating variations in “global sentiment” or describing the average Twitter user. A quick analysis of the public statuses captured with this method shows, for example, that the average (active) Twitter user follows around 500 other accounts, that a very small proportion of tweets are geo-located, and that Spanish is the second most common language in which Twitter users set up their interface.

sampleStream("tweetsSample.json", timeout = 120, oauth = my_oauth, verbose = FALSE)
tweets.df <- parseTweets("tweetsSample.json", verbose = FALSE)
mean(as.numeric(tweets.df$friends_count))
## [1] 543.5
table(is.na(tweets.df$lat))
## 
## FALSE  TRUE 
##   228 13503
round(sort(table(tweets.df$lang), decreasing = T)[1:5]/sum(table(tweets.df$lang)), 2)
## 
##   en   es   ja   pt   ar 
## 0.57 0.16 0.09 0.07 0.03

userStream

Finally, I have also included the function userStream, which allows the user to capture the tweets they would see in their timeline on twitter.com. As was the case with filterStream, this function allows to subset tweets by keyword and location, and exclude replies across users who are not followed. An example is shown below. Perhaps not surprisingly, many of the accounts I follow use Twitter in Spanish.

userStream("mytweets.json", timeout = 120, oauth = my_oauth, verbose = FALSE)
tweets.df <- parseTweets("mytweets.json", verbose = FALSE)
round(sort(table(tweets.df$lang), decreasing = T)[1:3]/sum(table(tweets.df$lang)), 2)
## 
##   en   es   ca 
## 0.62 0.30 0.08

More

In these examples I have used parseTweets to read the captured tweets from the text file where they were saved in disk and store them in a data frame in memory. The tweets can also be stored directly in memory by leaving the file.name argument empty, but my personal preference is to save the raw text, usually in different files, one for each hour or day. Having the files means I can run UNIX commands to quickly compute the number of tweets in each period, since each tweet is saved in a different line:

system("wc -l 'tweetsSample.json'", intern = TRUE)
## [1] "   15086 tweetsSample.json"

Concluding...

I hope this package is useful for R users who want to at least play around with this type of data. Future releases of the package will include additional functions to analyze captured tweets, and improve the already existing so that they handle errors better. My plan is to keep the GitHub version up to date fixing any possible bugs, and release only major versions to CRAN.

You can contact me at pablo.barbera[at]nyu.edu or via twitter (@p_barbera) for any question or suggestion you might have, or to report any bugs in the code.

More Repositories

1

Rfacebook

Dev version of Rfacebook package: Access to Facebook API via R
R
350
star
2

twitter_ideology

Estimating Ideological Positions with Twitter Data
R
206
star
3

voter-files

Python scripts to parse U.S. voter files
Python
118
star
4

social-media-workshop

Workshop: Collecting and Analyzing Social Media Data with R
HTML
104
star
5

instaR

Dev version of instaR package: Access to Instagram API via R
R
103
star
6

Rdataviz

Materials for workshop "Data Visualization with R and ggplot2"
R
76
star
7

scholarnetwork

Extract and Visualize Google Scholar Collaboration Networks
R
76
star
8

pytwools

Python tools for the analysis of Twitter data
Python
72
star
9

quant3materials

PhD course: Quantitative Methods for Political Science III (NYU) -- Recitation Materials
R
50
star
10

workshop

Workshop: Scraping Twitter and Web Data Using R
R
47
star
11

eui-text-workshop

Methods workshop: Automated Text Analysis with R
HTML
34
star
12

echo_chambers

Replication materials for the paper "Tweeting from Left to Right: Is Online Political Communication More Than an Echo Chamber?"
R
34
star
13

POIR613-2019

Course materials: POIR 613 - Computational Social Science - USC Fall 2019
HTML
31
star
14

ECPR-SC105

ECPR Summer School: Big Data Analysis in the Social Sciences
HTML
31
star
15

POIR613-2021

Course materials: POIR 613 - Computational Social Science - USC Fall 2021
HTML
31
star
16

big-data-upf

RECSM-UPF Summer School: Social Media and Big Data Research
HTML
22
star
17

incivility-sage-open

Incivility classifier used in Theocharis et al (2020, Sage Open)
R
20
star
18

ECPR-SC104

ECPR Summer School: Automated Collection of Web and Social Data
HTML
15
star
19

NYU-AD-160J

Recitation materials for the NYU-Abu Dhabi course "Social Media and Political Participation", J-TERM 2014
R
14
star
20

text-analysis-vienna

HTML
13
star
21

social-media-upf

Summer School: Social Media and Big Data Research
HTML
13
star
22

POIR613

Course materials: POIR 613 - Computational Social Science - USC Fall 2022
HTML
10
star
23

twitter-incivility

Replication materials for the paper "A Bad Workman Blames His Tweets. The Consequences of Citizens' Uncivil Twitter Use when Interacting with Party Candidates"
R
9
star
24

POIR613-2017

2017 version of the PhD-level course POIR 613 - Computational Social Science
Makefile
8
star
25

pablobarbera.github.com

Website
HTML
7
star
26

SQL-workshop

Workshop: Querying large-scale online datasets - SQL and Google BigQuery
HTML
7
star
27

votes

Tracking Twitter Users Who "Tweet Their Vote"
R
3
star
28

icourts-workshop

Workshop materials: the use of Social Media for the study of International Courts
HTML
2
star
29

world-leaders-isq

Replication materials for the paper "The New Public Address System: Why Do World Leaders Adopt Social Media?"
Stata
2
star
30

eitm

EITM Europe Summer Instute: Social Media Research
HTML
2
star
31

zombies

Code for Zombie Outbreak Detector
R
1
star
32

corruption-psrm

Replication materials for the paper "Rooting Out Corruption or Rooting for Corruption? The Heterogeneous Electoral Consequences of Scandals"
R
1
star