• Stars
    star
    1,150
  • Rank 40,552 (Top 0.8 %)
  • Language
    Python
  • Created over 10 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A web privacy measurement framework

OpenWPM Documentation Status Build Status OpenWPM Matrix Channel

OpenWPM is a web privacy measurement framework which makes it easy to collect data for privacy studies on a scale of thousands to millions of websites. OpenWPM is built on top of Firefox, with automation provided by Selenium. It includes several hooks for data collection. Check out the instrumentation section below for more details.

Table of Contents

Installation

OpenWPM is tested on Ubuntu 18.04 via GitHub actions and is commonly used via the docker container that this repo builds, which is also based on Ubuntu. Although we don't officially support other platforms, mamba is a cross platform utility and the install script can be expected to work on OSX and other linux distributions.

OpenWPM does not support windows: #503

Pre-requisites

The main pre-requisite for OpenWPM is mamba, a fast cross-platform package management tool.

Mamba is open-source, and can be installed from https://mamba.readthedocs.io/en/latest/installation.html.

Mamba is a reimplmentation of conda and so sometimes a conda command has to be invoked instead of the mamba one.

Install

An installation script, install.sh is included to: install the conda environment, install unbranded firefox, and build the instrumentation extension.

All installation is confined to your conda environment and should not affect your machine. The installation script will, however, override any existing conda environment named openwpm.

To run the install script, run

./install.sh

After running the install script, activate your conda environment by running:

conda activate openwpm

Mac OSX

You may need to install make / gcc in order to build the extension. The necessary packages are part of xcode: xcode-select --install

We do not run CI tests for Mac, so new issues may arise. We welcome PRs to fix these issues and add full CI testing for Mac.

Running Firefox with xvfb on OSX is untested and will require the user to install an X11 server. We suggest XQuartz. This setup has not been tested, we welcome feedback as to whether this is working.

Quick Start

Once installed, it is very easy to run a quick test of OpenWPM. Check out demo.py for an example. This will use the default setting specified in openwpm/config.py::ManagerParams and openwpm/config.py::BrowserParams, with the exception of the changes specified in demo.py.

The demo script also includes a sample of how to use the Tranco top sites list via the optional command line flag demo.py --tranco. Note that since this is a real top sites list it will include NSFW websites, some of which will be highly ranked.

More information on the instrumentation and configuration parameters is given below.

The docs provide a more in-depth tutorial, and a description of the methods of data collection available.

Troubleshooting

  1. WebDriverException: Message: The browser appears to have exited before we could connect...

    This error indicates that Firefox exited during startup (or was prevented from starting). There are many possible causes of this error:

    • If you are seeing this error for all browser spawn attempts check that:

      • Both selenium and Firefox are the appropriate versions. Run the following commands and check that the versions output match the required versions in install.sh and environment.yaml. If not, re-run the install script.
      cd firefox-bin/
      firefox --version

      and

        conda list selenium
      • If you are running in a headless environment (e.g. a remote server), ensure that all browsers have the headless browser parameter set to True before launching.
    • If you are seeing this error randomly during crawls it can be caused by an overtaxed system, either memory or CPU usage. Try lowering the number of concurrent browsers.

  2. In older versions of firefox (pre 74) the setting to enable extensions was called extensions.legacy.enabled. If you need to work with earlier firefox, update the setting name extensions.experiments.enabled in openwpm/deploy_browsers/configure_firefox.py.

  3. Make sure you're conda environment is activated (conda activate openwpm). You can see you environments and the activate one by running conda env list the active environment will have a * by it.

  4. make / gcc may need to be installed in order to build the web extension. On Ubuntu, this is achieved with apt-get install make. On OSX the necessary packages are part of xcode: xcode-select --install.

  5. On a very sparse operating system additional dependencies may need to be installed. See the Dockerfile for more inspiration, or open an issue if you are still having problems.

  6. If you see errors related to incompatible or non-existing python packages, try re-running the file with the environment variable PYTHONNOUSERSITE set. E.g., PYTHONNOUSERSITE=True python demo.py. If that fixes your issues, you are experiencing issue 689, which can be fixed by clearing your python user site packages directory, by prepending PYTHONNOUSERSITE=True to a specific command, or by setting the environment variable for the session (e.g., export PYTHONNOUSERSITE=True in bash). Please also add a comment to that issue to let us know you ran into this problem.

Documentation

Further information is available at OPENWPM's Documentation Page.

Advice for Measurement Researchers

OpenWPM is often used for web measurement research. We recommend the following for researchers using the tool:

Use a versioned release. We aim to follow Firefox's release cadence, which is roughly once every four weeks. If we happen to fall behind on checking in new releases, please file an issue. Versions more than a few months out of date will use unsupported versions of Firefox, which are likely to have known security vulnerabilities. Versions less than v0.10.0 are from a previous architecture and should not be used.

Include the OpenWPM version number in your publication. As of v0.10.0 OpenWPM pins all python, npm, and system dependencies. Including this information alongside your work will allow other researchers to contextualize the results, and can be helpful if future versions of OpenWPM have instrumentation bugs that impact results.

Developer instructions

If you want to contribute to OpenWPM have a look at our CONTRIBUTING.md

Instrumentation and Configuration

OpenWPM provides a breadth of configuration options which can be found in Configuration.md More detail on the output is available below.

Storage

OpenWPM distinguishes between two types of data, structured and unstructured. Structured data is all data captured by the instrumentation or emitted by the platform. Generally speaking all data you download is unstructured data.

For each of the data classes we offer a variety of storage providers, and you are encouraged to implement your own, should the provided backends not be enough for you.

We have an outstanding issue to enable saving content generated by commands, such as screenshots and page dumps to unstructured storage (see #232).
For now, they get saved to manager_params.data_directory.

Local Storage

For storing structured data locally we offer two StorageProviders:

  • The SQLiteStorageProvider which writes all data into a SQLite database
    • This is the recommended approach for getting started as the data is easily explorable
  • The LocalArrowProvider which stores the data into Parquet files.
    • This method integrates well with NumPy/Pandas
    • It might be harder to ad-hoc process

For storing unstructured data locally we also offer two solutions:

  • The LevelDBProvider which stores all data into a LevelDB
    • This is the recommended approach
  • The LocalGzipProvider that gzips and stores the files individually on disk
    • Please note that file systems usually don't like thousands of files in one folder
    • Use with care or for single site visits

Remote storage

When running in the cloud, saving records to disk is not a reasonable thing to do. So we offer a remote StorageProviders for S3 (See #823) and GCP. Currently, all remote StorageProviders write to the respective object storage service (S3/GCS). The structured providers use the Parquet format.

NOTE: The Parquet and SQL schemas should be kept in sync except output-specific columns (e.g., instance_id in the Parquet output). You can compare the two schemas by running diff -y openwpm/DataAggregator/schema.sql openwpm/DataAggregator/parquet_schema.py.

Docker Deployment for OpenWPM

OpenWPM can be run in a Docker container. This is similar to running OpenWPM in a virtual machine, only with less overhead.

Building the Docker Container

Step 1: install Docker on your system. Most Linux distributions have Docker in their repositories. It can also be installed from docker.com. For Ubuntu you can use: sudo apt-get install docker.io

You can test the installation with: sudo docker run hello-world

Note, in order to run Docker without root privileges, add your user to the docker group (sudo usermod -a -G docker $USER). You will have to logout-login for the change to take effect, and possibly also restart the Docker service.

Step 2: to build the image, run the following command from a terminal within the root OpenWPM directory:

    docker build -f Dockerfile -t openwpm .

After a few minutes, the container is ready to use.

Running Measurements from inside the Container

You can run the demo measurement from inside the container, as follows:

First of all, you need to give the container permissions on your local X-server. You can do this by running: xhost +local:docker

Then you can run the demo script using:

    mkdir -p docker-volume && docker run -v $PWD/docker-volume:/opt/OpenWPM/datadir \
    -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix --shm-size=2g \
    -it openwpm

Note: the --shm-size=2g parameter is required, as it increases the amount of shared memory available to Firefox. Without this parameter you can expect Firefox to crash on 20-30% of sites.

This command uses bind-mounts to share scripts and output between the container and host, as explained below (note the paths in the command assume it's being run from the root OpenWPM directory):

  • run starts the openwpm container and executes the python /opt/OpenWPM/demo.py command.

  • -v binds a directory on the host ($PWD/docker-volume) to a directory in the container (/opt/OpenWPM/datadir). Binding allows the script's output to be saved on the host (./docker-volume), and also allows you to pass inputs to the docker container (if necessary). We first create the docker-volume direction (if it doesn't exist), as docker will otherwise create it with root permissions.

  • The -it option states the command is to be run interactively (use -d for detached mode).

  • The demo scripts runs instances of Firefox that are not headless. As such, this command requires a connection to the host display server. If you are running headless crawls you can remove the following options: -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix.

Alternatively, it is possible to run jobs as the user openwpm in the container too, but this might cause problems with none headless browers. It is therefore only recommended for headless crawls.

MacOS GUI applications in Docker

Requirements: Install XQuartz by following these instructions.

Given properly installed prerequisites (including a reboot), the helper script run-on-osx-via-docker.sh in the project root folder can be used to facilitate working with Docker in Mac OSX.

To open a bash session within the environment:

./run-on-osx-via-docker.sh /bin/bash

Or, run commands directly:

./run-on-osx-via-docker.sh python demo.py
./run-on-osx-via-docker.sh python -m test.manual_test
./run-on-osx-via-docker.sh python -m pytest
./run-on-osx-via-docker.sh python -m pytest -vv -s

Citation

If you use OpenWPM in your research, please cite our CCS 2016 publication on the infrastructure. You can use the following BibTeX.

@inproceedings{englehardt2016census,
    author    = "Steven Englehardt and Arvind Narayanan",
    title     = "{Online tracking: A 1-million-site measurement and analysis}",
    booktitle = {Proceedings of ACM CCS 2016},
    year      = "2016",
}

OpenWPM has been used in over 75 studies.

License

OpenWPM is licensed under GNU GPLv3. Additional code has been included from FourthParty and Privacy Badger, both of which are licensed GPLv3+.

More Repositories

1

pdf.js

PDF Reader in JavaScript
JavaScript
43,965
star
2

DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
C++
25,096
star
3

send

Simple, private file sharing from the makers of Firefox
FreeMarker
13,234
star
4

sops

Simple and flexible tool for managing secrets
Go
12,778
star
5

BrowserQuest

A HTML5/JavaScript multiplayer game experiment
JavaScript
9,167
star
6

nunjucks

A powerful templating engine with inheritance, asynchronous control, and more (jinja2 inspired)
JavaScript
8,570
star
7

geckodriver

WebDriver for Firefox
7,166
star
8

TTS

πŸ€– πŸ’¬ Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Jupyter Notebook
6,749
star
9

readability

A standalone version of the readability lib
JavaScript
6,470
star
10

sccache

Sccache is a ccache-like tool. It is used as a compiler wrapper and avoids compilation when possible. Sccache has the capability to utilize caching in remote storage environments, including various cloud storage options, or alternatively, in local storage.
Rust
5,763
star
11

mozjpeg

Improved JPEG encoder.
C
5,216
star
12

Fira

Mozilla's new typeface, used in Firefox OS
CSS
4,920
star
13

rhino

Rhino is an open-source implementation of JavaScript written entirely in Java
JavaScript
4,138
star
14

shumway

Shumway is a Flash VM and runtime written in JavaScript
TypeScript
3,692
star
15

source-map

Consume and generate source maps.
JavaScript
3,556
star
16

gecko-dev

Read-only Git mirror of the Mercurial gecko repositories at https://hg.mozilla.org. How to contribute: https://firefox-source-docs.mozilla.org/contributing/contribution_quickref.html
2,897
star
17

multi-account-containers

Firefox Multi-Account Containers lets you keep parts of your online life separated into color-coded tabs that preserve your privacy. Cookies are separated by container, allowing you to use the web with multiple identities or accounts simultaneously.
JavaScript
2,718
star
18

web-ext

A command line tool to help build, run, and test web extensions
JavaScript
2,695
star
19

bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
Python
2,651
star
20

node-convict

Featureful configuration management library for Node.js
JavaScript
2,334
star
21

cbindgen

A project for generating C bindings from Rust code
Rust
2,314
star
22

MozDef

DEPRECATED - MozDef: Mozilla Enterprise Defense Platform
Python
2,166
star
23

popcorn-js

The HTML5 Media Framework. (Unmaintained. See https://github.com/menismu/popcorn-js for activity)
JavaScript
2,148
star
24

webextension-polyfill

A lightweight polyfill library for Promise-based WebExtension APIs in Chrome
JavaScript
2,088
star
25

fathom

A framework for extracting meaning from web pages
JavaScript
1,969
star
26

cipherscan

A very simple way to find out which SSL ciphersuites are supported by a target.
Python
1,912
star
27

hawk

HTTP Holder-Of-Key Authentication Scheme
JavaScript
1,903
star
28

neqo

Neqo, the Mozilla Firefox implementation of QUIC in Rust
Rust
1,828
star
29

persona

Persona is a secure, distributed, and easy to use identification system.
JavaScript
1,828
star
30

http-observatory

Mozilla HTTP Observatory
Python
1,784
star
31

uniffi-rs

a multi-language bindings generator for rust
Rust
1,783
star
32

mentat

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.
Rust
1,650
star
33

task.js

Beautiful concurrency for JavaScript
JavaScript
1,635
star
34

hubs

Duck-themed multi-user virtual spaces in WebVR. Built with A-Frame.
JavaScript
1,561
star
35

fx-private-relay

Keep your email safe from hackers and trackers. Make an email alias with 1 click, and keep your address to yourself.
Python
1,473
star
36

pontoon

Mozilla's Localization Platform
Python
1,463
star
37

thimble.mozilla.org

UPDATE: This project is no longer maintained. Please check out Glitch.com instead.
JavaScript
1,423
star
38

kitsune

Platform for Mozilla Support
Python
1,289
star
39

mig

Distributed & real time digital forensics at the speed of the cloud
Go
1,195
star
40

grcov

Rust tool to collect and aggregate code coverage data for multiple source files
Rust
1,184
star
41

bedrock

Making mozilla.org awesome, one pebble at a time
HTML
1,176
star
42

policy-templates

Policy Templates for Firefox
1,138
star
43

server-side-tls

Server side TLS Tools
HTML
1,114
star
44

rust-android-gradle

Kotlin
989
star
45

contain-facebook

Facebook Container isolates your Facebook activity from the rest of your web activity in order to prevent Facebook from tracking you outside of the Facebook website via third party cookies.
JavaScript
975
star
46

pdfjs-dist

Generic build of PDF.js library.
JavaScript
952
star
47

narcissus

INACTIVE - http://mzl.la/ghe-archive - The Narcissus meta-circular JavaScript interpreter
JavaScript
901
star
48

openbadges-backpack

Mozilla Open Badges Backpack
JavaScript
861
star
49

addons-server

πŸ•Ά addons.mozilla.org Django app and API πŸŽ‰
Python
833
star
50

awsbox

INACTIVE - http://mzl.la/ghe-archive - A featherweight PaaS on top of Amazon EC2 for deploying node apps
JavaScript
811
star
51

dxr

DEPRECATED - Powerful search for large codebases
Python
804
star
52

ssh_scan

DEPRECATED - A prototype SSH configuration and policy scanner (Blog: https://mozilla.github.io/ssh_scan/)
Ruby
793
star
53

chromeless

DEPRECATED - Build desktop applications with web technologies.
JavaScript
761
star
54

node-client-sessions

secure sessions stored in cookies
JavaScript
745
star
55

blurts-server

Mozilla Monitor arms you with tools to keep your personal information safe. Find out what hackers already know about you and learn how to stay a step ahead of them.
Fluent
726
star
56

playdoh

PROJECT DEPRECATED (WAS: "Mozilla's Web application base template. Half Django, half awesomeness, half not good at math.")
Python
714
star
57

DeepSpeech-examples

Examples of how to use or integrate DeepSpeech
Python
682
star
58

cargo-vet

supply-chain security for Rust
Rust
665
star
59

tofino

Project Tofino is a browser interaction experiment.
HTML
655
star
60

addon-sdk

DEPRECATED - The Add-on SDK repository.
641
star
61

standards-positions

Python
639
star
62

MozStumbler

Android Stumbler for Mozilla
Java
621
star
63

application-services

Firefox Application Services
Rust
608
star
64

fxa

Monorepo for Mozilla Accounts (formerly Firefox Accounts)
TypeScript
593
star
65

lightbeam

Orignal unmaintained version of the Lightbeam extension. See lightbeam-we for the new one which works in modern versions of Firefox.
JavaScript
587
star
66

firefox-translations

Firefox Translations is a webextension that enables client side translations for web browsers.
JavaScript
579
star
67

moz-sql-parser

DEPRECATED - Let's make a SQL parser so we can provide a familiar interface to non-sql datastores!
Python
574
star
68

spidernode

Node.js on top of SpiderMonkey
JavaScript
560
star
69

ichnaea

Mozilla Ichnaea
Python
559
star
70

inclusion

Our repository for Diversity, Equity and Inclusion work at Mozilla
557
star
71

positron

a experimental, Electron-compatible runtime on top of Gecko
551
star
72

addons-frontend

Front-end to complement mozilla/addons-server
JavaScript
525
star
73

nixpkgs-mozilla

Mozilla overlay for Nixpkgs.
Nix
522
star
74

tls-observatory

An observatory for TLS configurations, X509 certificates, and more.
Go
518
star
75

bugbug

Platform for Machine Learning projects on Software Engineering
Python
503
star
76

neo

INACTIVE - http://mzl.la/ghe-archive - DEPRECATED: See https://neutrino.js.org for alternative
JavaScript
503
star
77

notes

DEPRECATED - A notepad for Firefox
HTML
495
star
78

django-csp

Content Security Policy for Django.
Python
486
star
79

skywriter

Mozilla Skywriter
JavaScript
481
star
80

Spoke

Easily create custom 3D environments
JavaScript
480
star
81

zamboni

Backend for the Firefox Marketplace
Python
474
star
82

vtt.js

A JavaScript implementation of the WebVTT specification
JavaScript
461
star
83

FirefoxColor

Theming demo for Firefox Quantum and beyond
JavaScript
460
star
84

mozilla-django-oidc

A django OpenID Connect library
Python
448
star
85

libdweb

Extension containing an experimental libdweb APIs
JavaScript
441
star
86

pointer.js

INACTIVE - http://mzl.la/ghe-archive - INACTIVE - http://mzl.la/ghe-archive - Normalizes mouse/touch events into 'pointer' events.
JavaScript
435
star
87

agithub

Agnostic Github client API -- An EDSL for connecting to REST servers
Python
419
star
88

cubeb

Cross platform audio library
C++
411
star
89

fxa-auth-server

DEPRECATED - Migrated to https://github.com/mozilla/fxa
JavaScript
401
star
90

zilla-slab

Mozilla's Zilla Slab Type Family
Shell
398
star
91

r2d2b2g

Firefox OS Simulator is a test environment for Firefox OS. Use it to test your apps in a Firefox OS-like environment that looks and feels like a mobile phone.
JavaScript
391
star
92

masche

Deprecated - MIG Memory Forensic library
Go
387
star
93

qbrt

CLI to a Gecko desktop app runtime
JavaScript
386
star
94

mp4parse-rust

Parser for ISO Base Media Format aka video/mp4 written in Rust.
Rust
380
star
95

valence

INACTIVE - http://mzl.la/ghe-archive - Firefox Developer Tools protocol adapters (Unmaintained)
JavaScript
377
star
96

OpenDesign

Mozilla Open Design aims to bring open source principles to Creative Design. Find us on Matrix: chat.mozilla.org/#/room/#opendesign:mozilla.org
370
star
97

ssl-config-generator

Mozilla SSL Configuration Generator
Handlebars
366
star
98

reflex

Functional reactive UI library
JavaScript
364
star
99

mortar

INACTIVE - http://mzl.la/ghe-archive - A collection of web app templates
364
star
100

minion

Minion
354
star