• Stars
    star
    176
  • Rank 215,735 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 9 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

CommonCrawl Index Client (CDX Index API Client)

A simple python command line tool for retrieving a list of urls in bulk using the CommonCrawl Index API at http://index.commoncrawl.org (or any other web archive CDX Server API).

Examples

The tool takes advantage of the CDX Server Pagination API and the Python multiprocessing support to load pages (chunks) of a large url index in parallel.

This may be especially useful for prefix/domain extraction.

To use, first install dependencies: pip install -r requirements.txt (The script has been tested on Python 2.7.x)

For example, fetch all entries in the index for url http://iana.org/ from index CC-MAIN-2015-06, run: ./cdx-index-client.py -c CC-MAIN-2015-06 http://iana.org/

It is often good idea to check how big the dataset is: ./cdx-index-client.py -c CC-MAIN-2015-06 *.io --show-num-pages

will print the number of pages that will be fetched to get a list of urls in the '*.io' domain.

This will give a relative size of the query. A query with thousands of pages may take a long time!

Then, you might fetch a list of urls from the index which are part of the *.io domain, as follows:

./cdx-index-client.py -c CC-MAIN-2015-06 *.io --fl url -z

The -fl flag specifies that only the url should be fetched, instead of the entire index row.

The -z flag indicates to store the data compressed.

For the above query, the output will be stored in domain-io-N.gz where for each page N (padded to number of digits)

Usage Options

Below is the current list of options, also available by running ./cdx-index-client.py -h

usage: CDX Index API Client [-h] [-n] [-p PROCESSES] [--fl FL] [-j] [-z]
                            [-o OUTPUT_PREFIX] [-d DIRECTORY]
                            [--page-size PAGE_SIZE]
                            [-c COLL | --cdx-server-url CDX_SERVER_URL]
                            [--timeout TIMEOUT] [--max-retries MAX_RETRIES]
                            [-v] [--pages [PAGES [PAGES ...]]]
                            [--header [HEADER [HEADER ...]]] [--in-order]
                            url

positional arguments:
  url                   url to query in the index: For prefix, use:
                        http://example.com/* For domain query, use:
                        *.example.com

optional arguments:
  -h, --help            show this help message and exit
  -n, --show-num-pages  Show Number of Pages only and exit
  -p PROCESSES, --processes PROCESSES
                        Number of worker processes to use
  --fl FL               select fields to include: eg, --fl url,timestamp
  -j, --json            Use json output instead of cdx(j)
  -z, --gzipped         Storge gzipped results, with .gz extensions
  -o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX
                        Custom output prefix, append with -NN for each page
  -d DIRECTORY, --directory DIRECTORY
                        Specify custom output directory
  --page-size PAGE_SIZE
                        size of each page in blocks, >=1
  -c COLL, --coll COLL  The index collection to use or "all" to use all
                        available indexes. The default value is the most
                        recent available index
  --cdx-server-url CDX_SERVER_URL
                        Set endpoint for CDX Server API
  --timeout TIMEOUT     HTTP read timeout before retry
  --max-retries MAX_RETRIES
                        Number of retry attempts
  -v, --verbose         Verbose logging of debug msgs
  --pages [PAGES [PAGES ...]]
                        Get only the specified result page(s) instead of all
                        results
  --header [HEADER [HEADER ...]]
                        Add custom header to request
  --in-order            Fetch pages in order (default is to shuffle page list)

Additional Use Cases

While this tool was designed specifically for use with the index at http://index.commoncrawl.org, it can also be used with any other running cdx server, including pywb, OpenWayback and IA Wayback.

The client uses a common subset of pywb CDX Server API and the original IA Wayback CDX Server API and so should work with either of these tools.

To specify a custom api endpoint, simply use the --cdx-server-url flag. For example, to connect to a locally running server, you can run:

./cdx-index-client.py example.com/* --cdx-server-url http://localhost:8080/pywb-cdx

More Repositories

1

webarchiveplayer

NOTE: This project is no longer being actively developed.. Check out Webrecorder Player for the latest player. https://github.com/webrecorder/webrecorderplayer-electron) (Legacy: Desktop application for browsing web archives (WARC and ARC)
Python
193
star
2

webarchive-indexing

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
Python
41
star
3

pywb-webrecorder

Check out https://github.com/webrecorder/webrecorder for newer version matching https://webrecorder.io
Python
39
star
4

browsertrix

(Note: This repository is obsolete, please see the new Browsertrix webrecorder/browsertrix) Browser-Based On-Demand Web Archiving Automation
Python
39
star
5

certauth

Simple CertificateAuthority and host certificate creation, useful for man-in-the-middle HTTPS proxy
Python
26
star
6

cc-index-server

Deployment of pywb as a CommonCrawl Index Server
HTML
21
star
7

pywb-proxy-demo

Demo of pywb usage as HTTP/S Proxy for Web Replay
Shell
7
star
8

proxy-wabac

Proof-of-Concept Proxy Mode wabac.js replay
JavaScript
7
star
9

client-replay-wayback

A test configuration for eventual client-side replay wayback deployment
JavaScript
7
star
10

memento-reconstruct

Cross-Archive Web Replay Using Memento
JavaScript
6
star
11

pywb-ipfs

Experimental recording and replay of WARCs to/from IPFS (https://ipfs.io/)
Python
5
star
12

warcsigner

CLI tools for signing / verifying compressed archive files with an RSA key pair.
Python
5
star
13

pywb-samples

Sample Archived Content for pywb
HTML
5
star
14

pywb-warcbase

pywb support for warcbase
Python
5
star
15

webrec-platform

Webrecorder Platform Components
Python
4
star
16

earlybrowserreborn

Automatically exported from code.google.com/p/earlybrowserreborn
C
3
star
17

spice-chrome

Experiment with OWT Chrome 60 and Spice protocol
Shell
3
star
18

pywb-ia

pywb setup for Internet Archive web archives
3
star
19

docker-pywb

Docker image for pywb
Shell
3
star
20

via

https://via.hypothes.is (pywb + hypothes.is annotations)
Java
3
star
21

ipfs-url-index

Experimental URL->CID index using b trees (chunky-trees from @mikeal)
JavaScript
2
star
22

js-ipfs-in-memory-repo

Create an in-memory only js-ipfs repo
JavaScript
2
star
23

mhtml-warc

CLI tools for converting MHTML <-> WARC
Python
2
star
24

important-warcs

Important Web Archives
1
star
25

oai-solr-tools

Webapp that supports the OAI-PMH protocol driven by a Solr index
Java
1
star
26

talks

Collection of reveal.js slides from various talks
JavaScript
1
star
27

loadzip

JavaScript
1
star
28

http-range-dat

Experiment Proxying HTTP Range Requests to DAT random access
JavaScript
1
star
29

pywb-opensearch-cdx

Experimental module for reading CDX from OpenSearch XML
Python
1
star