• Stars
    star
    123
  • Rank 280,043 (Top 6 %)
  • Language
    JavaScript
  • License
    Other
  • Created about 4 years ago
  • Updated 23 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿ•ธ Modular, multithreaded, puppeteer-based crawler

DuckDuckGo Tracker Radar Collector

๐Ÿ•ธ Modular, multithreaded, puppeteer-based crawler used to generate third party request data for the Tracker Radar.

How do I use it?

Use it from the command line

  1. Clone this project locally (git clone [email protected]:duckduckgo/tracker-radar-collector.git)
  2. Install all dependencies (npm i)
  3. Run the command line tool:
npm run crawl -- -u "https://example.com" -o ./data/ -v

Available options:

  • -o, --output <path> - (required) output folder where output files will be created
  • -u, --url <url> - single URL to crawl
  • -i, --input-list <path> - path to a text file with list of URLs to crawl (each in a separate line)
  • -d, --data-collectors <list> - comma separated list (e.g -d 'requests,cookies') of data collectors that should be used (all by default)
  • -c, --crawlers <number> - override the default number of concurrent crawlers (default number is picked based on the number of CPU cores)
  • --reporters <list> - comma separated list (e.g. --reporters 'cli,file,html') of reporters to be used ('cli' by default)
  • -v, --verbose - instructs reporters to log additional information (e.g. for "cli" reporter progress bar will not be shown when verbose logging is enabled)
  • -l, --log-path <path> - instructs reporters where all logs should be written to
  • -f, --force-overwrite - overwrite existing output files (by default entries with existing output files are skipped)
  • -3, --only-3p - don't save any first-party data (e.g. requests, API calls for the same eTLD+1 as the main document)
  • -m, --mobile - emulate a mobile device when crawling
  • -p, --proxy-config <host> - optional SOCKS proxy host
  • -r, --region-code <region> - optional 2 letter region code. For metadata only
  • -a, --disable-anti-bot - disable simple build-in anti bot detection script injected to every frame
  • --chromium-version <version_number> - use custom version of Chromium (e.g. "843427") instead of using the default
  • --config <path> - path to a config file that allows to set all the above settings (and more). Note that CLI flags have a higher priority than settings passed via config. You can find a sample config file in tests/cli/sampleConfig.json.
  • --autoconsent-action <action> - automatic autoconsent action (requires the cmps collector). Possible values: optIn, optOut

Use it as a module

  1. Install this project as a dependency (npm i git+https://github.com:duckduckgo/tracker-radar-collector.git).

  2. Import it:

// you can either import a "crawlerConductor" that runs multiple crawlers for you
const {crawlerConductor} = require('tracker-radar-collector');
// or a single crawler
const {crawler} = require('tracker-radar-collector');

// you will also need some data collectors (/collectors/ folder contains all build-in collectors)
const {RequestCollector, CookieCollector, โ€ฆ} = require('tracker-radar-collector');
  1. Use it:
crawlerConductor({
    // required โ†“
    urls: ['https://example.com', {url: 'https://duck.com', dataCollectors: [new ScreenshotCollector()]}, โ€ฆ], // two formats available: first format will use default collectors set below, second format will use custom set of collectors for this one url
    dataCallback: (url, result) => {โ€ฆ},
    // optional โ†“
    dataCollectors: [new RequestCollector(), new CookieCollector()],
    failureCallback: (url, error) => {โ€ฆ},
    numberOfCrawlers: 12,// custom number of crawlers (there is a hard limit of 38 though)
    logFunction: (...msg) => {โ€ฆ},// custom logging function
    filterOutFirstParty: true,// don't save any first-party data (false by default)
    emulateMobile: true,// emulate a mobile device (false by default)
    proxyHost: 'socks5://myproxy:8080',// SOCKS proxy host (none by default)
    antiBotDetection: true,// if anti bot detection script should be injected (true by default)
    chromiumVersion: '843427',// Chromium version that should be downloaded and used instead of the default one
    maxLoadTimeMs: 30000,// how long should crawlers wait for the page to load, defaults to 30s
    extraExecutionTimeMs: 2500,// how long should crawlers wait after page loads before collecting data, defaults to 2.5s
});

OR (if you prefer to run a single crawler)

// crawler will throw an exception if crawl fails
const data = await crawler(new URL('https://example.com'), {
    // optional โ†“
    collectors: [new RequestCollector(), new CookieCollector(), โ€ฆ],
    log: (...msg) => {โ€ฆ},
    urlFilter: (url) => {โ€ฆ},// function that, for each request URL, decides if its data should be stored or not
    emulateMobile: false,
    emulateUserAgent: false,// don't use the default puppeteer UA (default true)
    proxyHost: 'socks5://myproxy:8080',
    browserContext: context,// if you prefer to create the browser context yourself (to e.g. use other browser or non-incognito context) you can pass it here (by default crawler will create an incognito context using standard chromium for you)
    runInEveryFrame: () => {window.alert('injected')},// function that should be executed in every frame (main + all subframes)
    executablePath: '/some/path/Chromium.app/Contents/MacOS/Chromium',// path to a custom Chromium installation that should be used instead of the default one
    maxLoadTimeMs: 30000,// how long should the crawler wait for the page to load, defaults to 30s
    extraExecutionTimeMs: 2500,// how long should crawler wait after page loads before collecting data, defaults to 2.5s
});

โ„น๏ธ Hint: check out crawl-cli.js and crawlerConductor.js to see how crawlerConductor and crawler are used in the wild.

Output format

Each successfully crawled website will create a separate file named after the website (when using the CLI tool). Output data format is specified in crawler.js (see CollectResult type definition). Additionally, for each crawl metadata.json file will be created containing crawl configuration, system configuration and some high-level stats.

Data post-processing

Example post-processing script, that can be used as a template, can be found in post-processing/summary.js. Execute it from the command line like this:

node ./post-processing/summary.js -i ./collected-data/ -o ./result.json

โ„น๏ธ Hint: When dealing with huge amounts of data you may need to increase nodejs's memory limit e.g. node --max_old_space_size=4096.

Creating new collectors

Each collector needs to extend the BaseCollector and has to override following methods:

  • id() which returns name of the collector (e.g. 'cookies')
  • getData(options) which should return collected data. options have following properties:
    • finalUrl - final URL of the main document (after all redirects) that you may want to use,
    • filterFunction which, if provided, takes an URL and returns a boolean telling you if given piece of data should be returned or filtered out based on its origin.

Additionally, each collector can override following methods:

  • init(options) which is called before the crawl begins
  • addTarget(targetInfo) which is called whenever new target is created (main page, iframe, web worker etc.)
  • postLoad() which is called after the page has loaded. This is the place for executing heavy page interactions (extraExecutionTimeMs is applied after this hook).

There are couple of built-in collectors in the collectors/ folder. CookieCollector is the simplest one and can be used as a template.

Each new collector has to be added in two places to be discoverable:

  • crawlerConductor.js - so that crawlerConductor knows about it (and it can be used in the CLI tool)
  • main.js - so that the new collector can be imported by other projects

You can also add types to define the structure of the data exported by your collector. These should be added to the CollectorData type in collectorsList.js. This will add type hints to all places where the data is used in the code.

More Repositories

1

Android

DuckDuckGo Android App
Kotlin
3,528
star
2

iOS

DuckDuckGo iOS Application
HTML
1,764
star
3

duckduckgo

DuckDuckGo Instant Answer Infrastructure
Perl
1,746
star
4

tracker-radar

Data set of top third party web domains with rich metadata about them
JavaScript
1,443
star
5

duckduckgo-privacy-extension

DuckDuckGo Privacy Essentials browser extension for Firefox, Chrome.
JavaScript
1,187
star
6

zeroclickinfo-goodies

DuckDuckGo Instant Answers based on Perl & JavaScript
Perl
979
star
7

zeroclickinfo-spice

DuckDuckGo Instant Answers based on JavaScript (JSON) APIs
JavaScript
548
star
8

community-platform

DuckDuckGo Community Platform
JavaScript
458
star
9

android-search-and-stories

DuckDuckGo Search & Stories for Android
Java
425
star
10

zeroclickinfo-fathead

DuckDuckGo Instant Answers based on keyword data files
Python
316
star
11

cpp-libface

Fastest auto-complete in the east
C++
258
star
12

tracker-radar-detector

Code used to build a Tracker Radar data set from raw crawl data.
JavaScript
186
star
13

ios-search-and-stories

DuckDuckGo Search & Stories for iOS
Objective-C
176
star
14

duckduckgo-help-pages

DuckDuckGo Help Pages
SCSS
165
star
15

duckduckcrawl

Distributed crawling prototype for DuckDuckGO
Python
143
star
16

smarter-encryption

Perl
136
star
17

tracker-blocklists

Web tracker blocklists used by DuckDuckGo apps and extensions.
JavaScript
135
star
18

privacy-for-safari

DuckDuckGo Privacy for Safari
Swift
94
star
19

duckduckgo-locales

Translation files for duckduckgo.com
Perl
93
star
20

privacy-configuration

๐ŸŽ› Configuration files used by DuckDuckGo's apps and extensions to control which privacy protections are enabled.
JavaScript
88
star
21

zeroclickinfo-longtail

DuckDuckGo Instant Answers based on full-text data
Python
86
star
22

firefox-zeroclickinfo

Firefox Add-on using the DuckDuckGo Zero-click Info API
JavaScript
85
star
23

duckduckhack-docs

DuckDuckHack Instant Answer documentation for developers
76
star
24

filter-bubble-study

Python scripts used to analyse Google search results for the DuckDuckGo 2018 filter bubble study.
Python
75
star
25

chrome-zeroclickinfo

Chrome Extension using the DuckDuckGo Zero-click Info API
JavaScript
73
star
26

privacy-test-pages

๐Ÿ›ก Collection of pages for testing various privacy and security features of browsers and browser extensions.
HTML
63
star
27

autoconsent

TypeScript
58
star
28

p5-app-duckpan

DuckDuckHack OpenSource Development Application
Perl
52
star
29

duckduckgo-publisher

Generation of the static files of DuckDuckGo and its microsites.
Perl
51
star
30

tracker-radar-wiki

Generation scripts and source for Tracker Radar Wiki
47
star
31

duckduckgo-documentation

Deprecated - OLD - See Below
44
star
32

php5-duckduckgo

PHP5 library for the DuckDuckGo Zero-click Info API
PHP
42
star
33

BrowserServicesKit

Swift
42
star
34

api

Zero-click API Libraries
JavaScript
41
star
35

duckduckgo-styles

Common styling elements for all DuckDuckGo properties
SCSS
39
star
36

duckduckhack.com

This repo contains the static content used to build DuckDuckHack.com
CSS
27
star
37

duckduckgo-utils

JS utility methods used by DuckDuckGo
JavaScript
26
star
38

content-scope-scripts

Content Scope Scripts handles injecting in DOM modifications in a browser context; it's a cross platform solution that requires some minimal platform hooks.
JavaScript
25
star
39

replaceawordinafamousquotewithduck

Powering the #replaceawordinafamousquotewithduck micro-site
Perl
25
star
40

duckduckgo-autofill

HTML
24
star
41

bloom_cpp

C++
23
star
42

ddg-screen-diff

Visual regression tool for DuckDuckGo
JavaScript
23
star
43

safari-zeroclickinfo

Safari extension using the DuckDuckGo Zero-click Info API
HTML
20
star
44

chrome-filterbubble

Chrome extension which shows you what you are missing on Google.
JavaScript
19
star
45

litestrap

Litestrap framework used by DuckDuckGo
CSS
19
star
46

duckduckgo-translate

DuckDuckGo translation library
JavaScript
19
star
47

duckduckgo-answerbar-templates

Templates used in DuckDuckGo's Instant Answers
HTML
19
star
48

tracker-surrogates

๐Ÿ’‰ Surrogates are small scripts that our apps and extensions serve in place of trackers that cause site breakage when blocked.
JavaScript
18
star
49

content-blocking-whitelist

Shell
17
star
50

opera-zeroclickinfo

ZeroClickInfo for Opera
JavaScript
16
star
51

duckduckgo-colors

CSS
14
star
52

duckduckgo-template-helpers

Template helpers used by DuckDuckGo
JavaScript
14
star
53

opera-speeddial

DuckDuckGo Opera speed dial extension
JavaScript
12
star
54

zeroclickinfo-goodie-qrcode

QRCode Goodie of DuckDuckGo
Perl
12
star
55

chrome-webstore

DuckDuckGo in the Chrome Webstore
Makefile
12
star
56

privacy-reference-tests

๐Ÿงช Test metadata used by DuckDuckGo apps and extensions to verify implementation of privacy features
JavaScript
11
star
57

p5-www-duckduckgo

Access to the DuckDuckGo APIs
Perl
11
star
58

DaxMailer

Subscriber and Bang submission handling
Perl
10
star
59

privacy-grade

JavaScript
10
star
60

p5-duckpan-installer

DuckPAN Perl Installer
Perl
9
star
61

duckpan-docker

A Dockerfile for installing DuckPAN.
Dockerfile
9
star
62

TrackerRadarKit

Swift
8
star
63

netguard

C
8
star
64

DuckDuckHack-APIs

duckduckhack.com APIs, services, web resources.
7
star
65

p5-app-duckduckgo-ui

Optional text UI for App::DuckDuckGo
Perl
7
star
66

zeroclickinfo-goodie-chords

Plugin for computing chords and scales
Perl 6
7
star
67

privacy-dashboard

JavaScript
7
star
68

DuckDuckBox

A central repository for DuckDuckBox, that is being used in DuckDuckGo browser extensions
JavaScript
7
star
69

zeroclickinfo-goodie-spell

Spellcheck goodie using Aspell
Perl
7
star
70

Launchpad

0-click results from Launchpad
Python
6
star
71

mv3-compat-tests

JavaScript
5
star
72

wireguard-apple

Swift
5
star
73

p5-dist-zilla-plugin-uploadtoduckpan

Dist::Zilla Plugin to upload to our duckpan.org
Perl
5
star
74

content-blocking-lists

5
star
75

eslint-config-duckduckgo

JavaScript Style Guide
JavaScript
5
star
76

ddg2dnr

Scripts to generate declarativeNetRequest rulesets for the DuckDuckGo browser extension. This now lives in the duckduckgo-privacy-extension repository, see link below.
JavaScript
5
star
77

windows-zeroclickinfo

Windows Application for checking DuckDuckGo ZeroClickInfo
JavaScript
5
star
78

remote-messaging-config

4
star
79

ios-js-support

JavaScript
4
star
80

DesignResourcesKit

Swift
4
star
81

sync_crypto

C
4
star
82

community-platform-static

Shared static files of the community platform and related sites
CSS
4
star
83

danger-settings

TypeScript
4
star
84

pull-request-helper

A simple tool that builds a markdown checklist containing example queries for Instant Answers
JavaScript
3
star
85

content-scope-utils

JavaScript Modules for Native Apps and Extensions
JavaScript
3
star
86

smileys

Smileys used for 0-click info.
Python
3
star
87

p5-dzp-announcerelease

Announce new instant answer releases and its changes
Perl
3
star
88

zeroclickinfo-goodie-math

A DuckDuckGo goodie for rendering LaTeX math using MathJax
JavaScript
3
star
89

zeroclickinfo-goodie-isvalid

IsValid::JSON and IsValid::XML DuckDuckGo Goodies
Perl 6
3
star
90

p5-dzp-automodulesharedirs

Automatically install sharedirs for modules
Perl
2
star
91

p5-dzp-iachangelog

Add an Instant Answer change log to a release
Perl
2
star
92

p5-dist-zilla-plugin-buildshareassets

Prepares files in instant answer share directories for production
Perl
2
star
93

p5-app-duckduckgo

Application to access the DuckDuckGo API
Perl
2
star
94

p5-dist-zilla-plugin-duckpanmeta

DistZilla plugin for gathering DuckPAN related (so far only DDG related) meta information
Perl
2
star
95

BareBonesBrowser

vanilla webview browser for iOS/macOS
Swift
1
star
96

remote-feature-flagging-config

1
star
97

OpenSSL-XCFramework

Shell
1
star