• Stars
    star
    2,841
  • Rank 15,866 (Top 0.4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 8 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Download the entire Wayback Machine archive for a given URL.

waybackpack v0.5.0

Waybackpack is a command-line tool that lets you download the entire Wayback Machine archive for a given URL.

For instance, to download every copy of the Department of Labor's homepage through 1996 (which happens to be the first year the site was archived), you'd run:

waybackpack http://www.dol.gov/ -d ~/Downloads/dol-wayback --to-date 1996

Result:

~/Downloads/dol-wayback/
β”œβ”€β”€ 19961102145216
β”‚Β Β  └── www.dol.gov
β”‚Β Β      └── index.html
β”œβ”€β”€ 19961103063843
β”‚Β Β  └── www.dol.gov
β”‚Β Β      └── index.html
β”œβ”€β”€ 19961222171647
β”‚Β Β  └── www.dol.gov
β”‚Β Β      └── index.html
└── 19961223193614
    └── www.dol.gov
        └── index.html

Or, just to print the URLs of all archived snapshots:

waybackpack http://www.dol.gov/ --list

Installation

pip install waybackpack

Usage

usage: waybackpack [-h] [--version] (-d DIR | --list) [--raw] [--root ROOT]
                   [--from-date FROM_DATE] [--to-date TO_DATE]
                   [--user-agent USER_AGENT] [--follow-redirects]
                   [--uniques-only] [--collapse COLLAPSE] [--ignore-errors]
                   [--max-retries MAX_RETRIES] [--no-clobber] [--quiet]
                   [--progress]
                   url

positional arguments:
  url                   The URL of the resource you want to download.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -d DIR, --dir DIR     Directory to save the files. Will create this
                        directory if it doesn't already exist.
  --list                Instead of downloading the files, only print the list
                        of snapshots.
  --raw                 Fetch file in its original state, without any
                        processing by the Wayback Machine or waybackpack.
  --root ROOT           The root URL from which to serve snapshotted
                        resources. Default: 'https://web.archive.org'
  --from-date FROM_DATE
                        Timestamp-string indicating the earliest snapshot to
                        download. Should take the format YYYYMMDDhhss, though
                        you can omit as many of the trailing digits as you
                        like. E.g., '201501' is valid.
  --to-date TO_DATE     Timestamp-string indicating the latest snapshot to
                        download. Should take the format YYYYMMDDhhss, though
                        you can omit as many of the trailing digits as you
                        like. E.g., '201604' is valid.
  --user-agent USER_AGENT
                        The User-Agent header to send along with your requests
                        to the Wayback Machine. If possible, please include
                        the phrase 'waybackpack' and your email address. That
                        way, if you're battering their servers, they know who
                        to contact. Default: 'waybackpack'.
  --follow-redirects    Follow redirects.
  --uniques-only        Download only the first version of duplicate files.
  --collapse COLLAPSE   An archive.org `collapse` parameter. Cf.: https://gith
                        ub.com/internetarchive/wayback/blob/master/wayback-
                        cdx-server/README.md#collapsing
  --ignore-errors       Don't crash on non-HTTP errors e.g., the requests
                        library's ChunkedEncodingError. Instead, log error and
                        continue. Cf.
                        https://github.com/jsvine/waybackpack/issues/19
  --max-retries MAX_RETRIES
                        How many times to try accessing content with 4XX or
                        5XX status code before skipping?
  --no-clobber          If a file is already present (and >0 filesize), don't
                        download it again.
  --quiet               Don't log progress to stderr.
  --progress            Print a progress bar. Mutes the default logging.
                        Requires `tqdm` to be installed.

Support

Waypackback is written in pure Python, depends only on requests, and should work wherever Python works. Should be compatible with both Python 2 and Python 3.

Thanks

Many thanks to the following users for catching bugs, fixing typos, and proposing useful features:

More Repositories

1

pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera β€”Β and easily extract text and tables.
Python
6,285
star
2

markovify

A simple, extensible Markov chain generator.
Python
3,292
star
3

nbpreview

Render Jupyter/IPython notebooks without running a notebook server.
CSS
289
star
4

notebookjs

Render Jupyter/IPython notebooks on the fly, in the browser. (Or on the command line, if you'd like.)
JavaScript
272
star
5

spectra

Easy color scales and color conversion for Python.
Python
257
star
6

envplus

Combine your Python virtualenvs.
Python
115
star
7

weightedcalcs

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
Python
103
star
8

reporter

Literate data analysis with iPython notebooks and Jekyll.
Ruby
92
star
9

twick

Twitter, quick. Fetch and store tweets on short notice.
Python
80
star
10

intro-to-visidata

Source files for "An Introduction to VisiData"
HTML
70
star
11

visidata-plugins

A place for me to share VisiData plugins I've written.
Python
36
star
12

mplstyle

A simple API for setting matplotlib styles, as well as a repository of nice styles.
Python
32
star
13

visidata-cheat-sheet

A one-page cheat sheet for VisiData, available in multiple languages.
HTML
26
star
14

gekyll

A Jekyll plugin for using Git repositories as posts, giving you access to a post's commits, diffs, and more.
Ruby
25
star
15

nbexec

A dead-simple tool for executing Jupyter notebooks from the command line.
Python
20
star
16

Backbone.Table

Render any Backbone.js Collection as an HTML table.
JavaScript
20
star
17

buzzfeed-news-trending-strip

Dataset: BuzzFeed News β€œTrending” Strip, 2018–2023
Python
19
star
18

tab-bankrupter

A Chrome extension for declaring "tab bankruptcy" without losing all your links.
JavaScript
18
star
19

astronomer

Fetch information about the users who've starred a given GitHub repository.
Python
17
star
20

txtbirds

β€Ύβ€Ύ\/β€Ύβ€Ύ
JavaScript
14
star
21

tinyapi

Python wrapper around TinyLetter's publicly accessible β€” but undocumented β€” API.
Python
13
star
22

fbpagefeed

A library and command-line tool for fetching Facebook Pages' published posts.
Python
12
star
23

virtualenv-recipes

Recipes for useful Python virtualenvs.
Shell
12
star
24

data-tactics

Half-baked idea: Conceptual building blocks for data analysis.
11
star
25

tinystats

Command-line tool for fetching message, URL, and subscriber data for the TinyLetter newsletters you own.
Python
11
star
26

vinejs

Somewhere between a total joke and a useful library for fetching Vine.co videos.
JavaScript
11
star
27

nicar-2024-pdfplumber-workshop

Jupyter Notebook
11
star
28

mta-colors

CSS & JSON files to help developers use the official colors of New York's Metropolitan Transportation Authority.
CSS
10
star
29

compleat

Fetch autocomplete suggestions from Google Search.
Python
9
star
30

google-table-converter

A browser-based tool for converting Google Spreadsheets into responsive HTML <table>s.
HTML
9
star
31

lede-2023

Jupyter Notebook
8
star
32

gifparse

[Work in progress.] Parse the GIF 89a file format, down to the minor details. Pure Python, no dependencies.
Python
8
star
33

nicar-2015-schedule

NICAR 2015 conference schedule as CSV and JSON, plus the underlying Python scraper.
Python
8
star
34

WRIT1-CE9741

WRIT1-CE9741, Fall 2013, NYU School of Continuing and Professional Studies
Ruby
6
star
35

nicar-2023-pdfplumber-workshop

Jupyter Notebook
6
star
36

csvcat

Efficiently concatenate CSVs (or other tabular text files), stripping extra header lines.
Shell
6
star
37

nicar-2017-schedule

NICAR 2017 conference schedule as JSON and CSV, plus the underlying Python scraper.
Python
6
star
38

babynames

CSVs and parsers for the Social Security Administration's historical baby name data.
Python
5
star
39

minicard

A bare-bones CSS stylesheet for creating "card"-style elements.
CSS
4
star
40

macmailer

Command-line utility and Ruby library for creating/sending messages in OSX's Mail.app program.
Ruby
4
star
41

nicar-now

Your unofficial guide to what's happening next at NICAR 2020.
3
star
42

text-toggle

Let readers toggle between two versions of a text.
JavaScript
3
star
43

fidget

Fidget.js is a small, configurable JavaScript library that resizes blocks of text to fit their containers.
JavaScript
3
star
44

statusfiles

IDEA: A simple, structured, standardized, technology-agnostic way to represent the status of things.
3
star
45

nicar-2018-schedule

Your unofficial guide to what's happening next at NICAR 2018.
Python
3
star
46

glat-glong

Find the precise latitude and longitude of any point on Google Maps. A Chrome extension.
JavaScript
3
star
47

lede-2024

Jupyter Notebook
3
star
48

gmap-button

A JavaScript library for adding buttons to embedded Google Maps.
JavaScript
2
star
49

crochet

Hook into and/or monkeypatch any Ruby class- or instance-method. Provides 'before' and 'after' hooks, plus their destructive evil twins.
Ruby
2
star
50

jub

As in, "get the jub done." Or as in, "jQuery, Underscore, Backbone." It's a shell script that automatically grabs the latest versions of those libraries, so that you can get on with prototyping.
Shell
2
star
51

download-all-attachments-from-a-gmail-conversation

Two methods that *seem* to work...
1
star
52

fbiter

A simple library for iterating through paginated Facebook API endpoints.
Python
1
star
53

weddingroulette

The code behind http://weddingroulette.com/
Ruby
1
star
54

jekyll-auto-s3

Automatically sync your Jekyll project to S3 on every (re)build.
Ruby
1
star
55

griddle

Griddle.js is lightweight tool for creating and manipulating programmable, fluid, shift-able grids.
JavaScript
1
star
56

linstapaper

Article-list and site files for linstapaper.com
JavaScript
1
star
57

nbtemplate

Render iPython notebooks to other layouts, via templates. Library and command-line tool.
Python
1
star
58

nicar-2019-schedule

The NICAR 2019 conference schedule as JSON and CSV files, plus the underlying Python scraper.
Python
1
star
59

parabear

An experiment in stupid-simple HTML article text extraction.
JavaScript
1
star