Badger Sett
A sett or set is a badger's den which usually consists of a network of tunnels and numerous entrances. Setts incorporate larger chambers used for sleeping or rearing young.
This script is designed to raise young Privacy Badgers by teaching them
about the trackers on popular sites. Every day, crawler.py
visits thousands of the top sites from the Tranco List with the latest version of Privacy Badger, and saves its findings in results.json
.
See the following EFF.org blog post for more information: Giving Privacy Badger a Jump Start.
Setup
-
Prerequisites: have docker installed. Make sure your user is part of the
docker
group so that you can build and run docker images withoutsudo
. You can add yourself to the group with$ sudo usermod -aG docker $USER
-
Clone the repository
$ git clone https://github.com/efforg/badger-sett
-
Run a scan
$ ./runscan.sh
This will run a scan with the latest version of Privacy Badger's master branch and won't commit the results.
To run the script with a different branch of privacy badger, set the
PB_BRANCH
variable. e.g.$ PB_BRANCH=my-feature-branch ./runscan.sh
You can also pass arguments to
crawler.py
, the python script that does the actual crawl. Any arguments passed torunscan.sh
will be forwarded tocrawler.py
. To control the number of sites that the crawler visits, use the--num-sites
argument (the default is 2000). For example:$ ./runscan.sh --num-sites 10
To exclude any sites with a given top level domain from the scan, pass in the
--exclude
argument followed by the TLD suffix you want to exclude. For example, if you wanted to exclude all sites with a .gov TLD:$ ./runscan.sh --exclude .gov
To exclude multiple TLDs from a scan, pass in each TLD separated by a comma, with no space between. For example, if you wanted to exclude all sites with .org and .net TLDs:
$ ./runscan.sh --exclude .org,.net
You can load another extension to run in parallel to Privacy Badger during a scan. Use the
--load-extension
flag and pass along the filepath for the.crx
or.xpi
file that you want to load. For example:$ ./runscan.sh --load-extension parallel-extensions/ublock.crx
-
Monitor the scan
To have the scan print verbose output about which sites it's visiting, use the
--log-stdout
argument.If you don't use that argument, all output will still be logged to
docker-out/log.txt
, beginning after the script outputs "Running scan in Docker..."
Automatic crawling
To set up the script to run periodically and automatically update the repository with its results:
-
Create a new ssh key with
ssh-keygen
. Give it a name unique to the repository.$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/USER/.ssh/id_rsa): /home/USER/.ssh/id_rsa_badger_sett
-
Add the new key as a deploy key with R/W access to the repo on Github. https://developer.github.com/v3/guides/managing-deploy-keys/
-
Add a SSH host alias for Github that uses the new key pair. Create or open
~/.ssh/config
and add the following:Host github-badger-sett HostName github.com User git IdentityFile /home/USER/.ssh/id_rsa_badger_sett
-
Configure git to connect to the remote over SSH. Edit
.git/config
:[remote "origin"] url = ssh://git@github-badger-sett:/efforg/badger-sett
This will have
git
connect to the remote using the new SSH keys by default. -
Create a cron job to call
runscan.sh
once a day. Set the environment variableRUN_BY_CRON=1
to turn off TTY forwarding todocker run
(which would break the script in cron), and setGIT_PUSH=1
to have the script automatically commit and pushresults.json
when the scan finishes. Here's an examplecrontab
entry:0 0 * * * RUN_BY_CRON=1 GIT_PUSH=1 /home/USER/badger-sett/runscan.sh
-
If everything has been set up correctly, the script should push a new version of
results.json
after each crawl. Soon, whenever youmake
a new version of Privacy Badger, it will pull the latest version of the crawler's data and ship it with the new version of the extension.