duckduckgo/duckduckcrawl

This repository has been archived on 10/Jan/2019
Stars
143
Rank 257,007 (Top 6 %)
Language
Python
License
Other
Created about 13 years ago
Updated almost 6 years ago

duckduckgo/duckduckcrawl

duckduckgo

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Distributed crawling prototype for DuckDuckGO

DuckDuckGo distributed crawler (DDC) prototype

The purpose of this project is to prototype a distributed crawler for the DuckDuckGo search engine.

Protocol

Basic workflow

A client requests a list of domains to check for spam, the server answers with a list of domains
The server might also add in the response additional data to ask the client to upgrade itself or the page analysis component
The client does the analysis on the domains, and then sends the results back to the server
The client request another bunch of domains to check and so on

Implementation

It's a classic REST API
To get a domain list the client sends a GET request, and to post the results it sends a POST request

URL parameters:

version : the protocol version which defines the XML response structure, it must be incremented when a change breaks client compatibility. The server must always handle all old protocol versions, to at least to tell the clients they must upgrade
pc_version : the version of the page processing binary component

XML response format

It contains one of these nodes immediately above the root:

'upgrades' : can contain nodes to tell the client to upgrade its components (with URL to download the new version)
'domainlist' : the list of domains to check ('domain' nodes)

Files

ddc_client.py : Code for a crawling worker
ddc_process.py : This file contains the code that simulates the binary component, currently it returns dumb results just to simulate
ddc_server.py : Code for the server that distributes the crawling work to the clients and gets the result from them
tests/single_client.sh : Bash script to do a small simulation by launching the server and connecting a client to it
tests/client_upgrade.sh : Bash script to simulate a client upgrade initiated by the server

Dependencies

Ubuntu users

On recent Ubuntu versions, you can install all dependencies by running the following command line:

sudo apt-get -V install python3 python3-httplib2

The code has only been tested on Linux but is fully OS neutral.

Android

DuckDuckGo Android App

iOS

DuckDuckGo iOS Application

duckduckgo

DuckDuckGo Instant Answer Infrastructure

tracker-radar

Data set of top third party web domains with rich metadata about them

duckduckgo-privacy-extension

DuckDuckGo Privacy Essentials browser extension for Firefox, Chrome.

zeroclickinfo-goodies

DuckDuckGo Instant Answers based on Perl & JavaScript

zeroclickinfo-spice

DuckDuckGo Instant Answers based on JavaScript (JSON) APIs

community-platform

DuckDuckGo Community Platform

android-search-and-stories

DuckDuckGo Search & Stories for Android

zeroclickinfo-fathead

DuckDuckGo Instant Answers based on keyword data files

cpp-libface

Fastest auto-complete in the east

macos-browser

DuckDuckGo macOS Browser

duckduckgo-help-pages

DuckDuckGo Help Pages

tracker-radar-detector

Code used to build a Tracker Radar data set from raw crawl data.

ios-search-and-stories

DuckDuckGo Search & Stories for iOS

tracker-blocklists

Web tracker blocklists used by DuckDuckGo apps and extensions.

smarter-encryption

tracker-radar-collector

🕸 Modular, multithreaded, puppeteer-based crawler

privacy-configuration

🎛 Configuration files used by DuckDuckGo's apps and extensions to control which privacy protections are enabled.

privacy-for-safari

DuckDuckGo Privacy for Safari

duckduckgo-locales

Translation files for duckduckgo.com

zeroclickinfo-longtail

DuckDuckGo Instant Answers based on full-text data

firefox-zeroclickinfo

Firefox Add-on using the DuckDuckGo Zero-click Info API

duckduckhack-docs

DuckDuckHack Instant Answer documentation for developers

autoconsent

filter-bubble-study

Python scripts used to analyse Google search results for the DuckDuckGo 2018 filter bubble study.

chrome-zeroclickinfo

Chrome Extension using the DuckDuckGo Zero-click Info API

privacy-test-pages

🛡 Collection of pages for testing various privacy and security features of browsers and browser extensions.

tracker-radar-wiki

Generation scripts and source for Tracker Radar Wiki

p5-app-duckpan

DuckDuckHack OpenSource Development Application

duckduckgo-publisher

Generation of the static files of DuckDuckGo and its microsites.

BrowserServicesKit

duckduckgo-documentation

Deprecated - OLD - See Below

php5-duckduckgo

PHP5 library for the DuckDuckGo Zero-click Info API

api

Zero-click API Libraries

duckduckgo-styles

Common styling elements for all DuckDuckGo properties

content-scope-scripts

Content Scope Scripts handles injecting in DOM modifications in a browser context; it's a cross platform solution that requires some minimal platform hooks.

duckduckhack.com

This repo contains the static content used to build DuckDuckHack.com

duckduckgo-utils

JS utility methods used by DuckDuckGo

duckduckgo-autofill

bloom_cpp

replaceawordinafamousquotewithduck

Powering the #replaceawordinafamousquotewithduck micro-site

ddg-screen-diff

Visual regression tool for DuckDuckGo

safari-zeroclickinfo

Safari extension using the DuckDuckGo Zero-click Info API

chrome-filterbubble

Chrome extension which shows you what you are missing on Google.

duckduckgo-answerbar-templates

Templates used in DuckDuckGo's Instant Answers

tracker-surrogates

💉 Surrogates are small scripts that our apps and extensions serve in place of trackers that cause site breakage when blocked.

litestrap

Litestrap framework used by DuckDuckGo

duckduckgo-translate

DuckDuckGo translation library

opera-zeroclickinfo

ZeroClickInfo for Opera

content-blocking-whitelist

duckduckgo-template-helpers

Template helpers used by DuckDuckGo

duckduckgo-colors

opera-speeddial

DuckDuckGo Opera speed dial extension

privacy-reference-tests

🧪 Test metadata used by DuckDuckGo apps and extensions to verify implementation of privacy features

zeroclickinfo-goodie-qrcode

QRCode Goodie of DuckDuckGo

chrome-webstore

DuckDuckGo in the Chrome Webstore

privacy-grade

p5-www-duckduckgo

Access to the DuckDuckGo APIs

TrackerRadarKit

DaxMailer

Subscriber and Bang submission handling

privacy-dashboard

p5-duckpan-installer

DuckPAN Perl Installer

duckpan-docker

A Dockerfile for installing DuckPAN.

netguard

DuckDuckHack-APIs

duckduckhack.com APIs, services, web resources.

p5-app-duckduckgo-ui

Optional text UI for App::DuckDuckGo

zeroclickinfo-goodie-chords

Plugin for computing chords and scales

DuckDuckBox

A central repository for DuckDuckBox, that is being used in DuckDuckGo browser extensions

eslint-config-duckduckgo

JavaScript Style Guide

zeroclickinfo-goodie-spell

Spellcheck goodie using Aspell

content-blocking-lists

Launchpad

0-click results from Launchpad

remote-messaging-config

mv3-compat-tests

p5-dist-zilla-plugin-uploadtoduckpan

Dist::Zilla Plugin to upload to our duckpan.org

wireguard-apple

windows-zeroclickinfo

Windows Application for checking DuckDuckGo ZeroClickInfo

ddg2dnr

Scripts to generate declarativeNetRequest rulesets for the DuckDuckGo browser extension. This now lives in the duckduckgo-privacy-extension repository, see link below.

ios-js-support

DesignResourcesKit

p5-dzp-announcerelease

Announce new instant answer releases and its changes

sync_crypto

community-platform-static

Shared static files of the community platform and related sites

danger-settings

BareBonesBrowser

vanilla webview browser for iOS/macOS

content-scope-utils

JavaScript Modules for Native Apps and Extensions

pull-request-helper

A simple tool that builds a markdown checklist containing example queries for Instant Answers

smileys

Smileys used for 0-click info.

zeroclickinfo-goodie-math

A DuckDuckGo goodie for rendering LaTeX math using MathJax

zeroclickinfo-goodie-isvalid

IsValid::JSON and IsValid::XML DuckDuckGo Goodies

p5-dzp-automodulesharedirs

Automatically install sharedirs for modules

remote-feature-flagging-config

p5-dzp-iachangelog

Add an Instant Answer change log to a release

p5-dist-zilla-plugin-buildshareassets

Prepares files in instant answer share directories for production

p5-app-duckduckgo

Application to access the DuckDuckGo API

p5-dist-zilla-plugin-duckpanmeta

DistZilla plugin for gathering DuckPAN related (so far only DDG related) meta information

OpenSSL-XCFramework