• Stars
    star
    119
  • Rank 297,930 (Top 6 %)
  • Language
    R
  • License
    GNU General Publi...
  • Created over 10 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

R tools for analysis of Twitter data

smappR Package: Tools for analysis of Twitter data

This package provides a series of functions that allow lab members to access the Mongo database of tweets in the lab computer and easily compute summary statistics. It also includes a function that implements the ideology estimation method developed in Barberá (2013), as well as methods to capture list of friends and followers using multiple OAuth tokens.

Click here for a summary of tools developed at the SMaPP lab.

Code examples

This document is intended as "coobook" that provides solutions to common tasks and problems in analyzing social media data collected by the lab.

The first three steps are:

1 - Install the R packages necessary to do the analysis

2 - Create your own Twitter OAuth token (only necessary if you plan to collect your own data)

3 - Connect to the lab Mongo DB server

The rest of the document offers "recipes" for the following situations:

1 - How can I count the number of tweets that match a set of conditions in a collection?

2 - How can I extract a data set of tweets that mention a keyword?

3 - How can I find the most retweeted tweets in a collection?

4 - How can I visualize tweet volume in a collection?

5 - How can I count words in tweets and prepare a word cloud?

6 - How can I download all the tweets, followers or friends of any Twitter user?

7 - How can I download profile information about multiple users and parse their location into geographic coordinates?

8 - How can I estimate the ideology of a Twitter user?

9 - How can I visualize a network of retweets?

10 - How can I start my own collection of tweets?

11 - How can I save tweets I collected on my own in the lab machine?

12 - How can I collect public Facebook data?

A. Installing R packages

The following block of code will install all packages that you need in order to run the code examples below. Most of them are available on CRAN. Rfacebook and streamR, the packages to capture Facebook and Twitter data, are also available on CRAN, but we will install the most recent version (on GitHub). Finally, we will also install our lab's R package.

doInstall <- TRUE  # Change to FALSE if you don't want packages installed.
toInstall <- c("ROAuth", "igraph", "ggplot2", "wordcloud", "devtools", "tm",
    "R2WinBUGS", "rmongodb", "scales")
if(doInstall){
    install.packages(toInstall, repos = "http://cran.r-project.org")
    library(devtools)
    # R packages to get twitter and Facebook data
    install_github("streamR", "pablobarbera", subdir="streamR")
    install_github("Rfacebook", "pablobarbera", subdir="Rfacebook")
    # smapp R package
    install_github("smappR", "SMAPPNYU")
}

B. Creating your own Twitter OAuth token

We will use this token to connect to the Twitter API and download data. Follow these steps:

  • Go to apps.twitter.com/ and sign in
  • Click on "Create New App"
  • Fill name, description, and website (it can be anything, even google.com)
  • Leave "Callback URL" empty
  • Agree to user conditions and enter captcha.
  • Copy consumer key and consumer secret and paste below.
library(ROAuth)
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "XXXXXXXXXXXX"
consumerSecret <- "YYYYYYYYYYYYYYYYYYY"
my_oauth <- OAuthFactory$new(consumerKey=consumerKey, consumerSecret=consumerSecret, 
    requestURL=requestURL, accessURL=accessURL, authURL=authURL)
  • Run this line and go to the URL that appears on screen
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
  • Copy and paste the PIN number (6 digits) on the R console
  • Change current folder into a folder where you will save all your tokens
setwd("~/Dropbox/credentials/")
  • Now you can save oauth token for use in future sessions with R
save(my_oauth, file="my_oauth")

C. Connecting to the SMaPP lab server

Before running any of the following sections, we need to make sure to load the smappR library and connect to the Mongo DB in the lab server. (Replace all the works in capital letters with those provided by the lab.)

library(smappR)
mongo <- mongo.create("MONGO_HOST:PORT", db="DATABASE")
mongo.authenticate(mongo, username='USERNAME', password='PASSWORD', db="DATABASE")
set <- "DATABASE.COLLECTION"

1. How can I count the number of tweets that match a set of conditions in a collection?

# how many tweets have we captured so far?
count.tweets(set)

# how many tweets mention the shutdown?
count.tweets(set, string="shutdown")

# how many tweets mention the shutdown OR obamacare?
count.tweets(set, string=c("shutdown", "obamacare"))

# how many tweets (not including retweets) mention obamacare?
count.tweets(set, string="obamacare", retweets=FALSE)

# how many tweets were published between December 10 and December 20?
count.tweets(set, from="2013-12-10 00:00:00", to="2013-12-20 23:59:59")

# how many tweets use a hashtag?
count.tweets(set, hashtags=TRUE)

# how many tweets were sent by barack obama?
count.tweets(set, screen_name="barackobama")

# for more info:
?count.tweets

2. How can I extract a data set of tweets that mention a keyword?

Assuming that we're still connected to the mongo DB (see section 1)

# how can I extract all tweets that mention "drone" in 2013?
tweets <- extract.tweets(set, string="drone", from="2013-01-01 00:00:00", to="2013-12-31 23:59:59")

# how can I see the first few tweets that were extracted?
df <- tweetsToDF(tweets) ## transforming to data frame
head(df)

# how can I see who has tweeted more about drones?
head(sort(table(df$user.screen_name), dec=TRUE))

# how can I save them to .csv?
write.csv(df, "drone_tweets.csv", row.names=F)

# how can I save to Stata?
library(foreign)
write.dta(df, "drone_tweets.dta")

# what if I also want the retweet count and the coordinates of each tweet?
tweets <- extract.tweets(set, string="drone",  fields = c("created_at", "user.screen_name", 
    "text", "retweeted_status.retweet_count", "geo.coordinates"))

# what fields can I include here?
?extract.tweets

# how can I extract a random sample of 1,000 tweets from the collection?
tweets <- extract.tweets(set, size=1000)

# how can I extract a random sample of 1,000 tweets sent on February 20?
tweets <- extract.tweets(set, size=1000, from="2014-02-20 00:00:00", to="2014-02-20 23:59:59")

# how can I extract a random sample of 100 tweets that mention obamacare?
tweets <- extract.tweets(set, size=100, string="obamacare")

# how can I extract a random sample of 100 tweets that mention obamacare and
# are NOT retweets?
tweets <- extract.tweets(set, size=100, string="obamacare", retweets=FALSE)

# how can I extract a random sample of 0.5% of the entire collection?
tweets <- extract.tweets(set, size=0.005)

# how can I extract tweets that were sent by barack obama?
tweets <- extract.tweets(set, screen_name="barackobama")

# more info:
?extract.tweets

3. How can I find the most retweeted tweets in a collection?

Assuming that we're still connected to the mongo DB (see section 1)

## how can I extract all tweets with more than 5000 retweets?
rts <- extract.retweets(set, min=5000)

## how can I find the 10 most retweeted?
summary(rts, n=10)

## how can I find the most retweeted tweet that mentions drones?
rts <- extract.retweets(set, min=10, string="drone")
summary(rts, n=1)

## how can I find the most retweeted tweet on February 20?
rts <- extract.retweets(set, min=5000, from="2014-02-20 00:00:00", to="2014-02-20 23:59:59")
summary(rts, n=1)

# BONUS:
## how can I find the most used hashtags in a collection?
ht <- extract.hashtags(set)
summary(ht, n=10)

## what hashtags are used in tweets that mention obamacare?
ht <- extract.hashtags(set, string="obamacare")
summary(ht, n=10)

## what were the 10 most common hashtags used on February 20?
ht <- extract.hashtags(set, from="2014-02-20 00:00:00", to="2014-02-20 23:59:59")
summary(ht, n=10)

## for more info:
?extract.retweets
?extract.hashtags

4. How can I visualize tweet volume in a collection?

Assuming that we're still connected to the mongo DB (see section 1)

# how can I download the 10,000 most recent tweets in a collection?
tweets <- extract.recent.tweets(set, fields=c('lang', 'created_at'), limit=1000000)

# and visualize volume by hour
plot(tweets, breaks="hours")

# or per day
plot(tweets, breaks="days")

# how can I extract all tweets that mention "drone"?
tweets <- extract.tweets(set, string="drone")

# how can I look at their distribution over time?
plot(tweets, breaks="days")

# how can I look at the distribution of tweets over a certain day?
tweets <- extract.tweets(set, from="2014-02-20 00:00:00", to="2014-02-20 23:59:59")
plot(tweets, breaks="hours")

5. How can I prepare a word cloud?

# how can I prepare a word cloud of recent tweets?
tweets <- extract.recent.tweets(set, limit=5000)
wordFreq <- word.frequencies(tweets$text) ## word counts

library(wordcloud)
wordcloud(words=names(wordFreq), freq=wordFreq, max.words=50, 
    random.order=F, colors="black", scale=c(2.5,.5), rot.per=0)

# how can I prepare a word cloud of tweets from a given day?
tweets <- extract.tweets(set, from="2014-02-20 00:00:00", to="2014-02-20 23:59:59")
df <- tweetsToDF(tweets)

wordFreq <- word.frequencies(df$text) ## word counts
wordcloud(words=names(wordFreq), freq=wordFreq, max.words=50, 
    random.order=F, colors="black", scale=c(2.5,.5), rot.per=0)

# another example, for tweets mentioning drones
tweets <- extract.tweets(set, string="drone")
wordFreq <- word.frequencies(tweets$text, stopwords=c("amp", "cant", "drones"))
wordcloud(words=names(wordFreq), freq=wordFreq, max.words=50, 
    random.order=F, colors="black", scale=c(2.5,.5), rot.per=0)

# word cloud for recent tweets of any user
getTimeline(screen_name = "p_barbera",
    filename = "pablo_tweets.json", # where tweets will be stored
    n=500, ## number of tweets to download (max is 3,200)
    oauth_folder = "~/Dropbox/credentials" )

library(streamR)
tweets <- parseTweets("pablo_tweets.json") ## reading tweets in R

wordFreq <- word.frequencies(tweets$text)

wordcloud(words=names(wordFreq), freq=wordFreq, max.words=50, 
    random.order=F, colors="black", scale=c(2.5,.5), rot.per=0)

wordFreq <- word.frequencies(tweets$text, 
        stopwords=c("que", "@p_barbera"))
wordcloud(words=names(wordFreq), freq=wordFreq, max.words=50, 
    random.order=F, colors="black", scale=c(2.5,.5), rot.per=0)

# for more info, again:
?getTimeline
?formatTwDate
?extract.recent.tweets

6. How can I download all the tweets, followers or friends of any Twitter user?

# downloading recent tweets sent by a user
getTimeline(screen_name = "p_barbera",
    filename = "pablo_tweets.json", # where tweets will be stored
    n=500, ## number of tweets to download (max is 3,200)
    oauth_folder = "~/Dropbox/credentials" )

# downloading list of followers of a given user
getFollowers(screen_name = "p_barbera", oauth_folder="~/Dropbox/credentials")

# downloading list of friends of a given user
getFriends(screen_name = "p_barbera", oauth_folder="~/Dropbox/credentials")

# BONUS: count the most common hashtags sent by a user
tweets <- parseTweets("pablo_tweets.json")
ht <- extract.hashtags(text=tweets$text)
summary(ht, n=10)

7. How can I download profile information about multiple users and parse their location into geographic coordinates?

# downloading information about multiple users
users.data <- getUsers(screen_names=c("p_barbera", "j_a_tucker", "smapp_nyu"),
    oauth_folder = "~/Dropbox/credentials")

# parsing locations into coordinates
locations <- list()
for (u in 1:length(users.data)){
    locations[[u]] <- getGeo(users.data[[u]][['location']])
}

8. How can I estimate the ideology of a Twitter user?

# step 1: downloading list of friends for a user
user <- "p_barbera"
friends <- getFriends(screen_name=user, oauth_folder="~/Dropbox/credentials/")

# step 2: estimate ideology
results <- estimate.ideology(user, friends)

# display trace plot to check convergence
traceplot.ideology(results)

# comparing with other ideology estimates
ideology.plot(results)

# downloading list of followers of a given user
followers <- getFollowers(screen_name=user, oauth_folder="~/Dropbox/credentials/")

9. How can I visualize a network of retweets?

# 1) download last 10,000 retweets, with author of retweet and author of
# original tweet
tweets <- extract.recent.tweets(set, limit=1000, 
    fields=c("user.id_str", "retweeted_status.user.id_str"),
    retweets=TRUE)

# 2) convert to data frame
df <- tweetsToDF(tweets)
names(df) <- c("Source", "Target")

# 3) preparing data for network visualization
library(igraph)
g <- graph.edgelist(as.matrix(df), directed=TRUE) ## network object
set.seed(12345)
l <-  layout.fruchterman.reingold(g, niter=2000, coolexp=20) ## layout
d <- data.frame(l); names(d) <- c("x", "y") ## data frame for plot
d$id <- V(g)$name ## adding names
d$degree <- degree(g)
d <- merge(d, data, all.x=TRUE) ## adding party

edgelist <- get.edgelist(g, names=FALSE) ## edgelist
edges <- data.frame(d[edgelist[,1],c("x", "y")], d[edgelist[,2],c("x", "y")])
names(edges) <- c("x1", "y1", "x2", "y2") ## coordinates of each edge
edges$party <- NA

# 4) creating plot with ggplot2

library(ggplot2)
p <- ggplot(d, aes(x=x, y=y, fill=party, size=degree))
pq <- p + geom_segment(
        aes(x=x1, y=y1, xend=x2, yend=y2), 
        data=edges, color="grey20", size=0.25, alpha=1/10) +
        geom_point(aes(size=degree), color="grey20", shape=21) +
        scale_size_continuous(range=c(2,7)) +
        scale_fill_manual(values=c("blue", "green", "red")) + 
        theme(
            panel.background = element_blank(),
            plot.background = element_blank(),
            axis.line = element_blank(), axis.text = element_blank(), 
            axis.ticks = element_blank(), 
            axis.title = element_blank(), panel.border = element_blank(), 
            panel.grid.major = element_blank(), 
            panel.grid.minor = element_blank()
            )
pq


# 6) exporting data to file to work with Gephi
write.csv(df, file="data/edgelist_congress.csv", row.names=F)

10. How can I start my own collection of tweets?

# loading library and OAuth token
library(streamR)
load("~/Dropbox/credentials/my_oauth")

# How can I collect all tweets that mention certain keywords?
filterStream(file.name="obama_tweets.json", track=c("obama", "romney"), oauth=my_oauth,
    tweets=50)

# How can I then see the tweets I collected?
tweets <- parseTweets("obama_tweets.json")

# a few quick analyses of the data
table(tweets$lang) ## distribution by language
sum(!is.na(tweets$lat)) ## how many are geolocated
summary(tweets$retweet_count) ## how many RTs they have
tweets$text[which.max(tweets$retweet_count)] ## most RTed tweet

# How can I collect a random sample of tweets?
sampleStream(file.name="random_tweets.json", oauth=my_oauth,
    tweets=100)

# same quick analysis
tweets <- parseTweets("random_tweets.json")
table(tweets$lang) ## distribution by language
sum(!is.na(tweets$lat)) ## how many are geolocated
summary(tweets$retweet_count) ## how many RTs they have
tweets$text[which.max(tweets$retweet_count)] ## most RTed tweet

# How can I collect geolocated tweets
filterStream(file.name="geo_tweets.json", oauth=my_oauth,
    locations=c(-180,-90,180,90), tweets=100)

11. How can I save tweets I collected on my own in the lab machine?

# Assume we want to save the geolocated tweets we just captured
tweetsToMongo(file.name="geo_tweets.json", ns="tweets.geotweets",
    host="HOST", username="USERNAME", password="PASSWORD")

# this will dump the tweets onto the lab machine, adding the extra fields
# necessary to get random samples, quick sampling by date, etc.

12. How can I collect public Facebook data?

# loading library
library(Rfacebook)

# adding your OAuth token
# 1) go to https://developers.facebook.com/tools/explorer/
# 2) click on "Get Access Token"
# 3) copy and paste below:
token <- "XXXXXX"

###############################################################################
## PART 1: COLLECTING PUBLIC FACEBOOK POSTS
###############################################################################

# How can I collect public Facebook posts that mention a given keyword?
posts <- searchFacebook("obama", token, n=100)

# How can I see the content of these posts?
posts[1,]

# Which of these posts got more likes?
posts[which.max(posts$likes_count),]

# How can I get more information about the users who posted about Obama?
users <- getUsers(posts$from_id, token)

# What information can I get about these users?
users[1,]

## NOTE: this information is available for ALL users, not only those who
## post stuff publicly
getUsers("pablobarbera", token)

# From what country are users posting about Obama?
countries <- substr(users$locale, 4, 5)
table(countries)

# What is the gender of users posting about Obama?
table(users$gender, exclude=NULL) ## The NA correspond to 'pages' (not users)

# In what language are users posting about Obama?
languages <- substr(users$locale, 1, 2)
table(languages)

# How can I collect old public Facebook posts?
posts <- searchFacebook("shutdown", token, n=1000, since="17 october 2013 02:00:00",
    until="17 october 2013 03:00:00")  ## usually pretty slow
# (note that all times are in GMT, so this would be 9pm to 10pm DC time)

# Which of these posts got more likes?
posts[which.max(posts$likes_count),]

# How can I generate a word cloud of the content of these posts?
library(smappR)
library(wordcloud)
wordFreq <- word.frequencies(posts$message)
wordcloud(words=names(wordFreq), freq=wordFreq, max.words=50, 
    random.order=F, colors="black", scale=c(2.5,.5), rot.per=0)


# more details:
?searchFacebook
?getUsers

###############################################################################
## PART 2: SCRAPING INFORMATION FROM FACEBOOK PAGES
###############################################################################

# How can I get a list of posts from a Facebook page?
page <- getPage("barackobama", token, n=100) ## ~100 most recent posts

# What information is available for each of these posts?
page[1,]

# Which post got more likes?
page[which.max(page$likes_count),]

# Which post got more comments?
page[which.max(page$comments_count),]

# Which post was shared the most?
page[which.max(page$shares_count),]

# How can I get a list of users who liked a specific post?
post <- getPost(page$id[1], token, n=2000) # first post, get 2000 likes/comments

# How can I see the gender, country, and language of these users?
users <- getUsers(post$likes$from_id, token)
countries <- substr(users$locale, 4, 5)
table(countries)
table(users$gender, exclude=NULL)
languages <- substr(users$locale, 1, 2)
table(languages)

# What are the most common first names?
head(sort(table(users$first_name), dec=TRUE), n=10)

# Compare these results to data from Sarah Palin's Facebook page...
page <- getPage("sarahpalin", token, n=100)
page[which.max(page$comments_count),] # most "commented" post
post <- getPost(page$id[1], token, n=2000, comments=FALSE, likes=TRUE)
users <- getUsers(post$likes$from_id, token) # user info for 2,000 users who
                                             # liked most recent post
countries <- substr(users$locale, 4, 5)
table(countries)
table(users$gender, exclude=NULL)
languages <- substr(users$locale, 1, 2)
table(languages)

More Repositories

1

youtube-data-api

A Python Client for collect and parse public data from the Youtube Data API
Jupyter Notebook
80
star
2

urlExpander

🌬️urlExpander is a Python package for expanding shortened links (urls).
Python
74
star
3

smappdragon

🐉 smappdragon is a set of tools for working with twitter data.
Python
29
star
4

lead_follow_apsr

Replication materials for "Who Leads? Who Follows? Measuring Issue Attention and Agenda Setting by Legislators and the Mass Public Using Social Media Data", forthcoming in APSR
HTML
25
star
5

smapphowto

💡 smapphowto has markdown files with instructions on how to do various things.
24
star
6

smappPy

🔨 smappPy addresses common tasks programmers dealing with lots of data
Python
24
star
7

pysmap

🐍 pysmap is a high level interface for working with twitter data.
Python
21
star
8

smapp-toolkit

Python library for interacting with smapp collections
Python
18
star
9

mediascores

R-package to estimate media scores from url sharing data
HTML
12
star
10

SMaBERTa

Wrapper for stable version of RoBERTa language models
Python
10
star
11

ProgrammerGroup

Code for programmer group, courses, and public exposition
Python
9
star
12

smapputil

⚙ smapputil is a repository with javascript , python, and bash scripts for doing things.
Python
8
star
13

smappboard

🎴 a dashboard to manage twitter collections, previously smapp-twitter-admin
Python
6
star
14

DASCL

PyTorch implementation of "Dictionary-Assisted Supervised Contrastive Learning" using RoBERTa
Python
6
star
15

crowdsourcing_factchecking_public

Jupyter Notebook
4
star
16

internal-dashboard

Jupyter Notebook
3
star
17

smapp_text_classifier

Helperfunctions for text classification for SMaPP@NYU.
Jupyter Notebook
3
star
18

twitter_elections_public_interest

Dataset of public interest interventions on Twitter for politicians and candidates during the 2020 US General Elections
3
star
19

Labels_Have_Limited_Effects

Replication code and data for manuscript titled: "News credibility labels have limited average effects on news diet quality and fail to reduce misperceptions"
R
2
star
20

twitter_election_misinfo_interventions

Repository for "Twitter flagged Donald Trump’s tweets with election misinformation. They continued to spread both on and off the platform."
Jupyter Notebook
2
star
21

krn_tools_demo

HTML
1
star
22

moc_ratios

Jupyter Notebook
1
star
23

SMAPP-Website

SMAPP Website built on top of Bootstrap
HTML
1
star
24

smapp_language_model

Deep language model for text classification
Python
1
star
25

smapp-tutorials

notebooks with tutorials on how do things
Jupyter Notebook
1
star
26

smapp-twitter-admin

Dashboard for ongoing smapp twitter collections
JavaScript
1
star
27

Do_Your_Own_Research

R
1
star