• Stars
    star
    152
  • Rank 239,460 (Top 5 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created almost 6 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The code used in the "Don't @ Me: Hunting Twitter Bots at Scale" Black Hat presentation

Twitter Bot Analysis

This project contains the files used in the Black Hat "Don't @ Me: Hunting Twitter Bots at Scale" presentation to discover and aggregate a large sample of Twitter profile and tweet information.

Setup

Account and Tweet Collection

Note: We're working on creating Docker containers to automate this setup - Stay tuned!

The data gathering is split into two scripts: collect.py which collects accounts, and collect_tweets.py which collects tweets from previously gathered accounts.

The first step is to install the requirements:

pip install -r requirements.txt

The next step is to create a Twitter application (detailed instructions coming soon!). After creating the application and generating an access token and secret, you can set the following environment variables:

export TWEEPY_CONSUMER_KEY="your_consumer_key"
export TWEEPY_CONSUMER_SECRET="your_consumer_secret"
export TWEEPY_ACCESS_TOKEN="your_access_token"
export TWEEPY_ACCESS_TOKEN_SECRET="your_access_token_secret"

Once these environment variables are set, you can set up the database. We encourage setting up an instance of MySQL for storing data. Once this is created, you can set the connection string as an environment variable:

export DB_PATH="path_to_mysql"

Finally, you'll want to set up a Redis instance for caching. After this is done, you're ready to start gathering data!

Feature Extraction

The code associated with feature extraction utilizes Spark and is written in Scala; therefore, you'll need to install both. Also, if you'd like to run this code outside of an IDE (e.g., IntelliJ or Eclipse), you'll also need to install sbt.

After that initial setup, you should be able to assemble a jar by merely running assembly from the sbt interactive shell. The output will be twitter-bots.jar.

Usage

Tweet and Account Collection

The first iteration of the tool is designed to make it as easy as possible to control what types of account/tweet enumeration you want to do.

At a high level, there are two types of account enumeration:

  • enum - This process takes --min-id and --max-id flags and looks up accounts in bulk using Twitter's users/lookup API endpoint. If you want a random sampling of accounts between the min and max ID's, you can set the --enum-percentage flag.

  • stream - This process takes a --stream-query parameter if you want to filter the Twitter stream by a keyword. If this isn't provided, the sample endpoint will be used.

This is the usage for the account collection script:

$ python collect.py -h
usage: collect.py [-h] [--max-id MAX_ID] [--min-id MIN_ID] [--db DB]
                  [--enum-percentage ENUM_PERCENTAGE] [--no-stream]
                  [--no-enum] [--stream-query STREAM_QUERY]
                  [--account-filename ACCOUNT_FILENAME] [--stdout]

Enumerate public Twitter profiles and tweets

optional arguments:
  -h, --help            show this help message and exit
  --max-id MAX_ID       Max Twitter ID to use for enumeration
  --min-id MIN_ID       Minimum ID to use for enumeration
  --enum-percentage ENUM_PERCENTAGE, -p ENUM_PERCENTAGE
                        The percentage of 32bit account space to enumerate
                        (0-100).
  --no-stream           Disable the streaming
  --no-enum             Disable the account id enumeration
  --stream-query STREAM_QUERY, -q STREAM_QUERY
                        The query to use when streaming results
  --account-filename ACCOUNT_FILENAME, -af ACCOUNT_FILENAME
                        The filename to store compressed account JSON data
  --stdout              Print JSON to stdout instead of a file

After gathering accounts, you can start gathering tweets. Here's the usage for collect_tweets.py:

$ python collect_tweets.py -h
usage: collect_tweets.py [-h] [--min-tweets MIN_TWEETS]
                         [--tweet-filename TWEET_FILENAME] [--no-lookup]
                         [--stdout]

Enumerate public Twitter tweets from discovered accounts

optional arguments:
  -h, --help            show this help message and exit
  --min-tweets MIN_TWEETS, -mt MIN_TWEETS
                        The minimum number of tweets needed before fetching
                        the tweets
  --tweet-filename TWEET_FILENAME, -tf TWEET_FILENAME
                        The filename to store compressed tweet JSON data
  --no-lookup           Disable looking up users found in tweets
  --stdout              Print JSON to stdout instead of a file

Note on Running the Scripts

The collect.py and collect_tweets.py scripts can be run at the same time. It's recommended to start by running collect.py, since this starts populating the database with accounts. Then, collect_tweets.py can fetch accounts from the database, and grab the tweets.

A good solution for running both the scripts together is through something like screen.

Output Formats

The database created is mainly used as intermediary storage. It holds metadata about accounts, as well as whether or not tweets have been fetched for it.

The real value is in the ndjson files created. The collect.py creates a file, accounts.json.gz that contains the compressed JSON account information with one record on each line. Likewise, the collect_tweets.py script creates a file, tweets.json.gz that contains the compressed JSON tweet information.

Feature Extraction

This is the usage for the feature extraction script:

Performs feature extraction on twitter account and tweet data and saves data to local file system or S3.

Usage: spark-submit --class ExtractionApp twitter-bots.jar --account-data <account_data_path> --tweet-data <tweet_data_path>
       --destination <output_path> --extraction-type [unknown|training]

  -a, --account-data  <arg>      Twitter account data
  -d, --destination  <arg>       Destination for extracted data
  -e, --extraction-type  <arg>   Extracting Cresci data (training) or outputs
                                 from other scripts (unknown)
  -t, --tweet-data  <arg>        Tweet data
      --help                     Show help message

The files that are expected when --extraction-type unknown are JSON for both the account and tweet data. When --extraction-type training, the file formats expected are csv and parquet respectively.

Crawling Social Networks

We've included a script crawl_networks.py that makes it easy to crawl the social network for an account and generate a GEXF file that can imported into a tool like Gephi for visualization.

Basic usage is as simple as python crawl_network.py jw_sec, where jw_sec is the account you want to crawl.

$ python crawl_network.py -h
usage: crawl_network.py [-h] [--graph-file GRAPH_FILE] [--raw RAW]
                        [--degree DEGREE] [--max-connections MAX_CONNECTIONS]
                        user

Crawl a Twitter user's social network

positional arguments:
  user                  User screen name to crawl

optional arguments:
  -h, --help            show this help message and exit
  --graph-file GRAPH_FILE, -g GRAPH_FILE
                        Filename for the output GEXF graph
  --raw RAW, -r RAW     Filename for the raw JSON data
  --degree DEGREE, -d DEGREE
                        Max degree of crawl
  --max-connections MAX_CONNECTIONS, -c MAX_CONNECTIONS
                        Max number of connections per account to crawl

Note: Social networks get large very quickly, and can take quite a while to crawl. We recommend setting the degree and max-connections flags to reasonable values (such as a 1-degree crawl with 1000 max connections per account) depending on your goals and available time.

Helpful Hints

The outputs of the tweet and account collection are very large files if run for an extended period of time; so, make sure that you have a cluster or single machine with the appropriate amount of available memory when running the feature extraction code.

More Repositories

1

cloudmapper

CloudMapper helps you analyze your Amazon Web Services (AWS) environments.
JavaScript
5,845
star
2

webauthn

WebAuthn (FIDO2) server library written in Go
Go
1,023
star
3

parliament

AWS IAM linting library
Python
986
star
4

cloudtracker

CloudTracker helps you find over-privileged IAM users and roles by comparing CloudTrail logs with current IAM policies.
Python
871
star
5

py_webauthn

Pythonic WebAuthn
Python
801
star
6

webauthn.io

The source code for webauthn.io, a demonstration of WebAuthn.
Python
619
star
7

EFIgy

A small client application that uses the Duo Labs EFIgy API to inform you about the state of your Mac EFI firmware
Python
508
star
8

dlint

Dlint is a tool for encouraging best coding practices and helping ensure we're writing secure Python code.
Python
331
star
9

markdown-to-confluence

Syncs Markdown files to Confluence
Python
294
star
10

isthislegit

Dashboard to collect, analyze, and respond to reported phishing emails.
Python
284
star
11

idapython

A collection of IDAPython modules made with 💚 by Duo Labs
Python
274
star
12

chrome-extension-boilerplate

Boilerplate code for a Chrome extension using TypeScript, React, and Webpack.
TypeScript
208
star
13

secret-bridge

Monitors Github for leaked secrets
Python
184
star
14

apple-t2-xpc

Tools to explore the XPC interface of Apple's T2 chip
Python
152
star
15

cloudtrail-partitioner

Python
147
star
16

phish-collect

Python script to hunt phishing kits
Python
136
star
17

phinn

A toolkit to generate an offline Chrome extension to detect phishing attacks using a bespoke convolutional neural network.
JavaScript
129
star
18

xray

X-Ray allows you to scan your Android device for security vulnerabilities that put your device at risk.
Java
122
star
19

android-webauthn-authenticator

A WebAuthn Authenticator for Android leveraging hardware-backed key storage and biometric user verification.
Java
110
star
20

appsec-education

Presentations, training modules, and other education materials from Duo Security's Application Security team.
JavaScript
67
star
21

mysslstrip

CVE-2015-3152 PoC
Python
43
star
22

EFIgy-GUI

A Mac app that uses the Duo Labs EFIgy API to inform you about the state of your EFI firmware.
Objective-C
39
star
23

lookalike-domains

generate lookalike domains using a few simple techniques (homoglyphs, alt TLDs, prefix/suffix)
Python
30
star
24

journal

The boilerplate for a new Journal site
21
star
25

srtgen

Automatic '.srt' subtitle generator
Python
21
star
26

apk2java

Automatically decompile APK's using Docker
Dockerfile
21
star
27

neustar2mmdb

Tool to convert Neustar's GeoPoint data to Maxmind's GeoIP database format for ease of use.
Python
19
star
28

markflow

Make your Markdown sparkle!
Python
18
star
29

narrow

Low-effort reachability analysis for third-party code vulnerabilities.
Python
18
star
30

tutorials

Python
15
star
31

duo-blog-going-passwordless-with-py-webauthn

Python
14
star
32

datasci-ctf

A capture-the-flag exercise based on data analysis challenges
Jupyter Notebook
14
star
33

chain-of-fools

A set of tools that allow researchers to experiment with certificate chain validation issues
Python
13
star
34

sharedsignals

Python tools for using OpenID's Shared Signals Framework (including CAEP)
12
star
35

journal-cli

The command-line client for Journal
Jupyter Notebook
12
star
36

unmasking_data_leaks

The code from the talk "Unmasking Data Leaks: A Guide to Finding, Fixing, and Prevention" given at BSides SATX 2019.
Python
7
star
37

journal-theme

The Hugo theme that powers Journal
HTML
7
star
38

golang-workshop

Source files for a Golang Workshop
Go
5
star
39

journal-docs

The documentation for Journal
2
star
40

dlint-plugin-example

An example plugin for dlint
Python
2
star
41

vimes

A local DNS proxy based on CoreDNS.
Python
2
star
42

twitterbots-wallpapers

Wallpapers created from the crawlers in our "Don't @ Me" technical research paper
1
star
43

holidayhack-2019

Scripts and artifacts used to solve the 2019 SANS Holiday Hack Challenge
Python
1
star