• This repository has been archived on 23/Feb/2022
  • Stars
    star
    141
  • Rank 259,971 (Top 6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 10 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Simplest way to get Tweets into BigQuery. Uses Google Cloud & App Engine, as well as Python and D3.

Twitter for BigQuery

This sample code will help you streaming Twitter data into BigQuery, and running simple visualizations. This sample also generates the queries you can run directly in the BigQuery interface, or extend for your applications.

Additionally, you can use other public or private datasets in BigQuery to do additional joins and develop other insights/correlations.

Requirements

Setup & Configuration

To work with Google Cloud and BigQuery, follow the below instructions to create a new project, service account and get your PEM file.

  • Go to http://console.developers.google.com

  • Click on "Create Project"

  • Open the project dashboard by clicking on the new project

  • Open "APIs & auth->Credentials"

  • Click on "Create new Client ID", "Service account" and "Create Client ID"

  • Note your Service Account email (Under "EMAIL ADDRESS")

  • Generate and store your P12 key (Or save from auto-download)

  • Convert the P12 key to a PEM file with the following:

    cat key.p12 | openssl pkcs12 -nodes -nocerts -passin pass:notasecret | openssl rsa > key.pem

  • Copy config.template to config

  • Fill out the following fields:

  • Run setup.py to generate appropriate yaml and config files in the image_gnip and image_twitter

Loading Twitter data into BigQuery from your local machine

As a pre-requisite for setting up BigQuery, you need to first set up a billing account. To do so:

  • Go to https://console.developers.google.com/billing and add a credit card
  • Back in your project view, click on the gear icon in the top-right and then "Project billing settings"
  • Ensure your project is associated with a billing account

The enclosed sample includes a simple load.py file to stream Tweets directly into BigQuery.

  • Go to http://console.developers.google.com
  • Go to your project
  • In the left-hand side, click on "Big Data->BigQuery" to open the BigQuery console
  • Click on the down arrow by the project, select "Create new dataset" and enter "twitter"
  • Run python load.py to begin loading data from your local machine

When developing on top of the Twitter platform, you must abide by the Developer Agreement & Policy.

Most notably, you must respect the section entitled "Maintain the Integrity of Twitter's Products", including removing all relevant Content with regard to unfavorites, deletes and other user actions.

Loading Twitter data into BigQuery from Google Compute Engine

To help simplify your setup, this project is designed to use:

The Dockerfile describes the required libraries and packaging for the container. The below runs through the steps to create your own container and deploy it to Google Compute Engine.

# start docker locally
boot2docker start
$(boot2docker shellinit)

# build and run docker image locally
docker build -t gcr.io/twitter_for_bigquery/image .
docker run -i -t gcr.io/twitter_for_bigquery/image

# push to Google Cloud container registry
gcloud preview docker push gcr.io/twitter_for_bigquery/image

# create and instance with docker container
gcloud compute instances create examplecontainervm01 \
    --image container-vm \
    --metadata-from-file google-container-manifest=./container.yaml \
    --zone us-central1-b \
    --machine-type n1-highcpu-2
    
# log into the new instance
gcloud compute instances list
gcloud compute --project "twitter-for-bigquery" ssh --zone "us-central1-b" "examplecontainervm01" 

# pull the container and run it in docker 
sudo docker pull gcr.io/twitter_for_bigquery/image
sudo docker run -d gcr.io/twitter_for_bigquery/image

# view logs to confirm its running
sudo -s
sudo docker ps
sudo docker logs --follow=true 5d

More notes for Docker + Google Cloud:

Running the app

Running locally

From the command line, you can use dev_appserver.py to run your local server. You'll need to specify your service account and private key file on the command line, as such:

dev_appserver.py . --appidentity_email_address="[email protected]" --appidentity_private_key_path=/PATH/TO/key.pem

Once this is complete, open your browser to http://localhost:8080.

Deploying on Google App Engine

To run in Google App Engine, do the following:

  • Update app.yaml with the project name pointing to your project.
  • Open the GAE Launcher.
  • Click on "File->New Application".
  • Specify the application ID (twitter-for-bigquery) and application directory (path where twitter-for-bigquery project exists).
  • Click "Save".
  • Select the Application in the list and click on "Edit->Application Settings".

  • In the "Extra Flags" section, add the command line flags, as above:

    --appidentity_email_address="[email protected]" --appidentity_private_key_path=/PATH_TO/key.pem

To confirm the deploy worked, you can do the following to view the logs:

Querying and loading large sets of tweets onto BigQuery

If you need large amounts of past tweets loaded onto BigQuery, you will need to use Gnip's Historical Power Track. The best way to load large amounts of tweets is:

  • Use the Gnip Python Historical Utilities library to run an async PowerTrack job and download the data.
  • Run the included batch.py file to process each gzip file and load onto BigQuery

When running the above processing, choose an environment that is optimized for network performance, as you may be downloading multiple GB of files onto your server and then onto BigQuery.

The schema

Schema

The load.py file takes tweets and loads them one-by-one into BigQuery. Some basic scrubbing of the data is done to simplify the dataset. (For more information, view the Utils.scrub() function.) Additionally, JSON files are provided in /schema as samples of the data formats from Gnip/Twitter and stored into BigQuery.

Sample queries

To help you get started, below are some sample queries.

Text search

Querying for tweets contain a specific word or phrase.

SELECT text FROM [twitter.tweets] WHERE text CONTAINS ' something ' LIMIT 10
#Hashtag search

Searching for specific hashtags.

SELECT entities.hashtags.text, HOUR(TIMESTAMP(created_at)) AS create_hour, count(*) as count FROM [twitter.tweets] WHERE LOWER(entities.hashtags.text) in ('John', 'Paul', 'George', 'Ringo') GROUP by create_hour, entities.hashtags.text ORDER BY entities.hashtags.text ASC, create_hour ASC
Tweet source

Listing the most popular Twitter applications.

SELECT source, count(*) as count FROM [twitter.tweets] GROUP by source ORDER BY count DESC LIMIT 1000
Media/URLs shared

Finding the most popular content shared on Twitter.

SELECT text, entities.urls.url FROM [twitter.tweets] WHERE entities.urls.url IS NOT NULL LIMIT 10
User activity

Users that tweet the most.

SELECT user.screen_name, count(*) as count FROM [twitter.tweets] GROUP BY user.screen_name ORDER BY count DESC LIMIT 10

To learn more about querying, go to [https://cloud.google.com/bigquery/query-reference]https://cloud.google.com/bigquery/query-reference)

Going further

Using BigQuery allows you to combine Twitter data with other public sources of information. Here are some ideas to inspire your next project:

  • Perform and store sentiment analysis on tweet text for worldwide sentiment
  • Cross reference Twitter data to other public data sets

You can also visit http://demo.redash.io/ to perform queries and visualizations against publicly available data sources.

FAQ

When deploying to AppEngine, I'm getting the error "This application does not exist (app_id=u'twitter-for-bigquery')"

You will want to create your own app_id in app.yaml. If that does not work, then per this thread (http://stackoverflow.com/questions/10407955/google-app-engine-this-application-does-not-exist), try the following:

`rm .appcfg_oauth2_tokens`

My TaskQueue entries die unexpectedly/only run for 10 minutes/get a DeadlineExceededError.

The default Google AppEngine TaskQueue (named 'default') has a limit of 10 minutes for any task. To run a task for longer, you need to set up a custom task queue and a backend server. The instructions are above, but the basics include:

  • Ensure the queues.xml file (which defines a new queue named 'backfill') is uploaded to AppEngine.
  • Ensure a background app is created using the appcfg.py update app.yaml backfill.yaml command to start both the main app and the background app.

I am getting 'Process terminated due to exceeding quotas.' errors in my log console/'This application is temporarily over its serving quota. Please try again later.' when accessing my backend server.

Google AppEngine has usage quotas to regulate billing and usage. You can read about the quotas for various products here:

https://cloud.google.com/appengine/docs/quotas#When_a_Resource_is_Depleted

To increase quota limits, you can go into Compute->App Engine->Settings and edit your daily budget to allow for increased usage.

https://console.developers.google.com/project/YOUR_PROJECT_NAME/appengine/settings

Additional reading

The following documents serve as additional information on streaming data from Twitter and working with BigQuery.

Credits

The following developers and bloggers have aided greatly in the development of this source. I'm appreciative of contributions and knowledge sharing.

TODO

  • One Pager
  • FAQ
  • Easier to deploy full stack
    • environment settings
    • container deploy script
  • Figure out location, specifically don't use Utils.scrub()
  • Admin save/config page + deploy of service?

More Repositories

1

Twitter-API-v2-sample-code

Sample code for the Twitter API v2 endpoints
JavaScript
2,684
star
2

twitter-api-typescript-sdk

A TypeScript SDK for the Twitter API
TypeScript
938
star
3

search-tweets-python

Python client for the Twitter 'search Tweets' and 'count Tweets' endpoints (v2/Labs/premium/enterprise). Now supports Twitter API v2 /recent and /all search endpoints.
Python
849
star
4

getting-started-with-the-twitter-api-v2-for-academic-research

A course on getting started with the Twitter API v2 for academic research
Python
572
star
5

twitter-api-java-sdk

A Java SDK for the Twitter API
Java
241
star
6

postman-twitter-api

Postman Collection for the Twitter API v2
220
star
7

cards-player-samples

Sample Code for Player Cards, both for stored and streamed video.
HTML
197
star
8

Gnip-Trend-Detection

Trend detection algorithms for Twitter time series data
Python
192
star
9

twitter-python-ads-sdk

A Twitter supported and maintained Ads API SDK for Python.
Python
189
star
10

account-activity-dashboard

Sample web app and helper scripts to get started with the premium Account Activity API
JavaScript
170
star
11

autohook

Automatically setup and serve webhooks for the Twitter Account Activity API
JavaScript
151
star
12

large-video-upload-python

Sample Python code for uploading video up to 140 seconds and/or up to 512Mb.
Python
126
star
13

do_more_with_twitter_data

Tutorials for getting the most out of Twitter data.
Makefile
103
star
14

bookmarks-to-notion

A sample app that exports your bookmarks to a Notion page
Python
96
star
15

real-time-tweet-streamer

JavaScript
85
star
16

SnowBotDev

An example #TwitterBot illustrating the Twitter Account Activity and Direct Message APIs.
Ruby
79
star
17

twitter-webhook-boilerplate-node

A simple Node.js app using Express 4 for Twitter DMs and webhooks.
JavaScript
72
star
18

twitter-ruby-ads-sdk

A Twitter supported and maintained Ads API SDK for Ruby.
Ruby
67
star
19

FactualCat-Twitter-Bot

A Twitter bot example using the v2 manage Tweets functionality
Python
66
star
20

bookmarks-search

Search your Twitter Bookmarks
JavaScript
56
star
21

search-tweets-ruby

Ruby client for the Twitter search endpoints (v2/Labs/premium/enterprise). Now supports Twitter API v2 /recent and /all search endpoints.
Ruby
55
star
22

tweet_parser

Reliably parse Tweets delivered by Twitter Data products in both the activity-streams and original formats.
Python
52
star
23

postman-twitter-ads-api

Postman collection for the Twitter Ads API
JavaScript
45
star
24

chrome-extension-collections

Chrome extension for reporters to organize tweets and oEmbed them into their CMS system.
JavaScript
43
star
25

tweet-search

Sample code showing Tweet activity volume using Twitter's Enterprise full-archive search API. Built with Django, Tweet embeds and C3.
JavaScript
41
star
26

open-evolution

Open evolution proposals for the Twitter API
41
star
27

node-timeline-visualizations

Interactive timeline of when your friends joined Twitter. Uses Node.js, twit and vis.js.
JavaScript
37
star
28

remote-dev-jobs-streamer

Match Tweets containing remote developer jobs using Filtered Stream and Tweet Annotations
JavaScript
36
star
29

tweet-updates

This repository contains information about the 2017 updates to Tweet formats for attachments and simplified replies.
HTML
34
star
30

twitter-context-annotations

Flat files containing available context annotation entities.
33
star
31

ruby-app-tweetmap

Simple Ruby app to read Twitter stream and map geo-tweets on a Google Map.
Ruby
33
star
32

spotatweet

A Spotify & Twitter API mashup showing what people are listening to now. Written in Node.js.
JavaScript
32
star
33

oauth2.0-bot

Sample code for creating a bot with OAuth 2.0 Authorization Code Flow with PKCE and V2 of the Twitter API.
Python
31
star
34

Gnip-Insights-Interface

Interface to Twitter's Engagement API
Python
31
star
35

tweet-notifier

A serverless app on AWS that gets Tweets of interest and publishes those to Slack, Amazon Chime & via Twilio SMS
Java
26
star
36

extract-usernames-from-tweet-replies

Python script to pull replies to a specific Tweet and extract user mentions
Python
25
star
37

labs-sample-code

Sample code for Twitter Developer Labs
JavaScript
25
star
38

weekly-tweet-sentiment

A tutorial which walks you through how you can create code that pulls your Tweets from the past 7 days and gives you a score to let you know exactly how your week has been.
Jupyter Notebook
23
star
39

ios-conversation-id-sample

Easily read recent public threads
Swift
21
star
40

engagement-api-client-ruby

This example Engagement API Client helps manage the process of generating engagement metadata for large Tweet collections.
Ruby
21
star
41

web-recipes

Recipes to build #hashtag campaigns, Tweet intents and other experiences using Twitter for Web.
JavaScript
20
star
42

streaming-demos-node

Basic demos using Twitter streaming APIs with sample/filter streams. Built with node.js.
JavaScript
19
star
43

twitter-streaming-framework

TypeScript/Node.js framework for processing Twitter data stream.
TypeScript
18
star
44

Twitter-API-to-Google-Sheets

A code sample that allows you to send a payload from the Twitter API to Google Sheets.
Jupyter Notebook
17
star
45

spaces-reach

A template app to show you how to get started with the Twitter Spaces API endpoints
JavaScript
17
star
46

gcloud-toolkit-recent-search

JavaScript
16
star
47

account-activity-dashboard-enterprise

Sample web app and helper scripts to get started with the enterprise Account Activity API
JavaScript
15
star
48

enterprise-scripts-python

Sample Python scripts to help get started with the Twitter Enterprise APIs
Python
15
star
49

Bookmarks-Notion-Notebook

Jupyter Notebook
14
star
50

compliant-client

A set of Python scripts for the Tweet and User batch compliance endpoints. Includes an app that manages it all in one go.
Python
14
star
51

.github

TwitterDev GitHub Organization-wide files
13
star
52

chrome-extension-tweetbar

Chrome Extension to add sidebar of Tweets to Youtube.
JavaScript
13
star
53

5-ways-to-convert-json-to-csv

JavaScript
12
star
54

bot-profile-append

A Python script to help you add user attributions to your Twitter bots
Python
11
star
55

gcloud-toolkit-filtered-stream

Google Cloud Toolkit for the Filtered Stream API
JavaScript
11
star
56

Gnip-Analysis-Pipeline

A processing pipeline for JSON-formatted Tweet data, such as that returned by Twitter APIs.
Python
11
star
57

parking

Jupyter Notebook
10
star
58

dog-facts

A repository of sample code designed to help you Tweet random dog facts
Python
10
star
59

twitter-aws-samples

Sample scripts for Twitter data processing and storage on AWS
Python
10
star
60

gcloud-toolkit-power-track

Google Cloud Toolkit for Twitter Enterprise - PowerTrack API
JavaScript
10
star
61

ETL

An example app demonstrating storing Tweets in a Google Spreadsheet
TypeScript
9
star
62

remote-dev-jobs-analytics

Learn 5 must know things about remote developer jobs posts on Twitter
JavaScript
9
star
63

live-leaderboard

This Flask app listens for incoming scores via Twitter Direct Messages, ranks them, and Tweets the rankings.
Python
9
star
64

micpic

Easily stan your favorite K-pop stars via an iOS 14 widget
Swift
8
star
65

Pull-Tweet-Annotations-data-for-Twitter-profiles

Python code to identify most common topics mentioned by a Twitter profile, using Tweet Annotations and Recent Search API
Python
8
star
66

export-bookmarks

Export your Bookmarked Tweets with Flask
Python
8
star
67

twitter-alexa-skill-apl

A sample Alexa skill that brings the Twitter experience to Alexa Devices that support APL. For multimodal devices, you can see Tweets about a certain topic, or trends for a city.
Java
8
star
68

search-tweets-python-in-r

Running the Python library search-tweets-python in R
R
6
star
69

Tweet-Annotations

App to demo various features and functionality powered by Tweet Annotations and the Twitter API v2
Python
6
star
70

covid19-helper

Chatbot template to help developer direct users towards useful COVID-19 resources in their own language.
JavaScript
6
star
71

noun-verb

A bot that Tweets noun/verb pairings
Python
5
star
72

cat-pics

Resources for Tweeting cat pictures.
Python
5
star
73

Gnip-Tweet-Evaluation

Python
4
star
74

mytwitterjam

Create a Spotify Playlist from songs shared on Twitter
JavaScript
4
star
75

make-music-together

Code for Jessica Garson's PyCon talk on making music with SuperCollider, FoxDot and Python.
Python
4
star
76

twitter-full-archive-search-ui

React based UI integrated to full archive search API
JavaScript
4
star
77

ruby-enterprise-scripts

Sample Ruby scripts for using the Twitter Enterprise APIs
Ruby
3
star
78

analytics-tag-check

Chrome extension to check for proper installation of Twitter Conversion Tracking tags
JavaScript
3
star
79

getting-started-with-r

Sample code for a blog post about how to use R with the Twitter API. Uses the rtweet library
R
3
star
80

serverless-flow-framework

Run and scale realtime data analysis flows on serverless infrastructure
TypeScript
3
star
81

gnip-python-sdk

Simple wrapper around Gnip Search API in Python.
Python
3
star
82

getting-started-with-dash

Getting started with data visualization with Dash and recent search counts
Python
2
star
83

run-your-favorite-python-package-in-r

An example of how to call a the Twitter API from a Python package inside of R.
R
2
star
84

engagement-api-explorer

A fun way to explore metrics for your Tweets or for the public conversation
JavaScript
2
star
85

hashtag-graph-viz

Example graph vizualisations for Mozfest 2022
Jupyter Notebook
2
star
86

Plot-Bookmarks

Plot your Bookmarks with Python
Python
2
star
87

sports-coach

An example app showing how to use the Hide Replies API to keep conversations on topic.
JavaScript
2
star
88

Gnip-Analysis-Tools

Python
1
star
89

cashtag-counts

Jupyter Notebook
1
star
90

twitter-enterprise-gcp

Enterprise API usage examples on Google Cloud Platform
JavaScript
1
star
91

Gnip-Filter-Optimization

Tools for optimizing Gnip PowerTrack rules and other downstream filters
Python
1
star
92

Code-from-TwitterDev-Twitch-streams

Sample code from @TwitterDev Twitch streams
Python
1
star
93

aws-toolkit-recent-search

Python
1
star
94

TwitterDev-live-streams

Code from live streams
Python
1
star
95

JSON-to-CSV-livestream

Jupyter Notebook
1
star