• Stars
    star
    403
  • Rank 106,466 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Tool To Push Web Resources Into Web Archives

Archive Now (archivenow)

A Tool To Push Web Resources Into Web Archives

Archive Now (archivenow) currently is configured to push resources into four public web archives. You can easily add more archives by writing a new archive handler (e.g., myarchive_handler.py) and place it inside the folder "handlers".

Update January 2021

Originally, archivenow was configured to push to 6 different public web archives. The two removed web archives are WebCite and archive.st. WebCite was removed from archivenow as they are no longer accepting archiving requests. Archive.st was removed from archivenow due to encountering a Captcha when attempting to push to the archive. In addition to removing those 2 archives, the method for pushing to archive.today and megalodon.jp from archivenow has been updated. In order to push to archive.today and megalodon.jp, Selenium is used.

As explained below, this library can be used through:

  • Command Line Interface (CLI)
  • A Web Service
  • A Docker Container
  • Python

Installing

The latest release of archivenow can be installed using pip:

$ pip install archivenow

The latest development version containing changes not yet released can be installed from source:

$ git clone [email protected]:oduwsdl/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./

In order to push to archive.today and megalodon.jp, archivenow must use Selenium, which has already been added to the requirements.txt. However, Selenium additionally needs a driver to interface with the chosen browser. It is recommended to use Selenium and archivenow with Firefox and Firefox's corresponding GeckoDriver.

You can download the latest versions of Firefox and the GeckoDriver to use with archivenow.

After installing the driver, you can push to archive.today and megalodon.jp from archivenow.

CLI USAGE

Usage of sub-commands in archivenow can be accessed through providing the -h or --help flag, like any of the below.

$ archivenow -h
usage: archivenow.py [-h] [--mg] [--cc] [--cc_api_key [CC_API_KEY]]
                     [--is] [--ia] [--warc [WARC]] [-v] [--all]
                     [--server] [--host [HOST]] [--agent [AGENT]]
                     [--port [PORT]]
                     [URI]

positional arguments:
  URI                   URI of a web resource

optional arguments:
  -h, --help            show this help message and exit
  --mg                  Use Megalodon.jp
  --cc                  Use The Perma.cc Archive
  --cc_api_key [CC_API_KEY]
                        An API KEY is required by The Perma.cc Archive
  --is                  Use The Archive.is
  --ia                  Use The Internet Archive
  --warc [WARC]         Generate WARC file
  -v, --version         Report the version of archivenow
  --all                 Use all possible archives
  --server              Run archiveNow as a Web Service
  --host [HOST]         A server address
  --agent [AGENT]       Use "wget" or "squidwarc" for WARC generation
  --port [PORT]         A port number to run a Web Service

Examples

Example 1

To save the web page (www.foxnews.com) in the Internet Archive:

$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com

Example 2

By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments are provided:

$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com

Example 3

To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and Archive.is:

$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com
http://archive.is/fPVyc

Example 4

To save the web page (https://nypost.com/) in all configured web archives. In addition to preserving the page in all configured archives, this command will also locally create a WARC file:

$ archivenow --all https://nypost.com/ --cc_api_key $Your-Perma-CC-API-Key
http://archive.is/dcnan
https://perma.cc/53CC-5ST8
https://web.archive.org/web/20181002081445/https://nypost.com/
https://megalodon.jp/2018-1002-1714-24/https://nypost.com:443/
https_nypost.com__96ec2300.warc

Example 5

To download the web page (https://nypost.com/) and create a WARC file:

$ archivenow --warc=mypage --agent=wget https://nypost.com/
mypage.warc

Server

You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 12345)

$ archivenow --server

Running on http://0.0.0.0:12345/ (Press CTRL+C to quit)

Example 6

To save the web page (www.foxnews.com) in The Internet Archive through the web service:

$ curl -i http://0.0.0.0:12345/ia/www.foxnews.com

    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 95
    Server: Werkzeug/0.11.15 Python/2.7.10
    Date: Tue, 02 Oct 2018 08:20:18 GMT

    {
      "results": [
        "https://web.archive.org/web/20181002082007/http://www.foxnews.com"
      ]
    }

Example 7

To save the web page (www.foxnews.com) in all configured archives though the web service:

$ curl -i http://0.0.0.0:12345/all/www.foxnews.com

    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 385
    Server: Werkzeug/0.11.15 Python/2.7.10
    Date: Tue, 02 Oct 2018 08:23:53 GMT

    {
      "results": [
        "Error (The Perma.cc Archive): An API Key is required ",
        "http://archive.is/ukads",
        "https://web.archive.org/web/20181002082007/http://www.foxnews.com",
        "Error (Megalodon.jp): We can not obtain this page because the time limit has been reached or for technical ... ",
        "http://www.webcitation.org/72rbKsX8B"
      ]
    }

Example 8

Because an API Key is required by Perma.cc, the HTTP request should be as follows:

$ curl -i http://127.0.0.1:12345/all/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key

Or use only Perma.cc:

$ curl -i http://127.0.0.1:12345/cc/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key

Running as a Docker Container

$ docker image pull oduwsdl/archivenow

Different ways to run archivenow

$ docker container run -it --rm oduwsdl/archivenow -h

Accessible at 127.0.0.1:12345:

$ docker container run -p 12345:12345 -it --rm oduwsdl/archivenow --server --host 0.0.0.0

Accessible at 127.0.0.1:22222:

$ docker container run -p 22222:11111 -it --rm oduwsdl/archivenow --server --port 11111 --host 0.0.0.0

http://www.cs.odu.edu/~maturban/archivenow-6-archives.gif

To save the web page (http://www.cnn.com) in The Internet Archive

$ docker container run -it --rm oduwsdl/archivenow --ia http://www.cnn.com

Python Usage

>>> from archivenow import archivenow

Example 9

To save the web page (www.foxnews.com) in all configured archives:

>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]

Example 10

To save the web page (www.foxnews.com) in The Perma.cc:

>>> archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"})
['https://perma.cc/8YYC-C7RM']

Example 11

To start the server from Python do the following. The server/port number can be passed (e.g, start(port=1111, host='localhost')):

>>> archivenow.start()

    2017-02-09 15:02:37
    Running on http://127.0.0.1:12345
    (Press CTRL+C to quit)

Configuring a new archive or removing existing one

Additional archives may be added by creating a handler file in the "handlers" directory.

For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write:

archivenow.push("www.cnn.com","ma")

In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. See the existing handler files for examples on how to organized a newly configured archive handler.

Removing an archive can be done by one of the following options:

  • Removing the archive handler file from the folder "handlers"
  • Renaming the archive handler file to other name that does not end with "_handler.py"
  • Setting the variable "enabled" to "False" inside the handler file

Notes

The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the "same" resource.

For example, if you send a request to IA to capture (www.cnn.com) at 10:00pm, IA will create a new copy (C) of this URI. IA will then return C for all requests to the archive for this URI received until 10:02pm. Using this same submission procedure for Archive.is requires a time gap of five minutes.

Citing Project

@INPROCEEDINGS{archivenow-jcdl2018,
  AUTHOR    = {Mohamed Aturban and
               Mat Kelly and
               Sawood Alam and
               John A. Berlin and
               Michael L. Nelson and
               Michele C. Weigle},
  TITLE     = {{ArchiveNow}: Simplified, Extensible, Multi-Archive Preservation},
  BOOKTITLE = {Proceedings of the 18th {ACM/IEEE-CS} Joint Conference on Digital Libraries},
  SERIES    = {{JCDL} '18},
  PAGES     = {321--322},
  MONTH     = {June},
  YEAR      = {2018},
  ADDRESS   = {Fort Worth, Texas, USA},
  URL       = {https://doi.org/10.1145/3197026.3203880},
  DOI       = {10.1145/3197026.3203880}
}

More Repositories

1

ipwb

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Python
606
star
2

CarbonDate

Estimating the age of web resources
HTML
94
star
3

warrick

Recover lost websites from the Web Infrastructure
HTML
85
star
4

MemGator

A Memento Aggregator CLI and Server in Go
Go
55
star
5

sumgram

sumgram is a tool that summarizes a collection of text documents by generating the most frequent sumgrams (conjoined ngrams)
Python
55
star
6

tweetedat

TweetedAt tells the time of a tweet based on its tweet id
HTML
43
star
7

FollowerCountHistory

Crawler that grabs Twitter follower counts across time via internet archives given account user name
Python
31
star
8

ORS

Object Resource Stream and CDXJ Drafts
15
star
9

MementoEmbed

A service that provides archive-aware oEmbed-compatible embeddable surrogates (social cards, thumbnails, etc.) for archived web pages (mementos).
HTML
15
star
10

Reconstructive

A ServiceWorker for client-side reconstruction of composite mementos
JavaScript
13
star
11

QueryClassification

Source code for domain classification (scholar or non-scholar) of a web query.
Python
11
star
12

raintale

A Python utility for publishing a social media story built from archived web pages to multiple services.
Python
11
star
13

MementoMap

A Tool to Summarize Web Archive Holdings
Python
9
star
14

off-topic-memento-toolkit

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
Python
9
star
15

Memento-aware-Browser

Chromium based memento-aware browser
9
star
16

aiu

A library for interacting with web archive collections at Archive-It, Trove, Pandora, and more.
Python
8
star
17

tmvis

An archival thumbnail visualization server
JavaScript
7
star
18

web-memento-damage

Web service to estimate the damage that exists on a memento
JavaScript
6
star
19

Extract-URLs

6
star
20

archivefacebook

JavaScript
6
star
21

storygraph-suite

A collection of software used by StoryGraphs (http://storygraph.cs.odu.edu/)
Python
5
star
22

US-Congress

Twitter handles for US Congress
5
star
23

IPNS-Blockchain

An IPNS implementation using Blockchain with Memento support
5
star
24

Scholar-Groups

HTML
5
star
25

hypercane

A toolkit for developing algorithms that sample mementos from a web archive collection.
Python
5
star
26

mementos-fixity

Python
4
star
27

archive_profiler

Scripts to generate profiles of various Web archives
Python
4
star
28

acm-paper-template

A starter LaTeX template for ACM conferences such as JCDL
TeX
4
star
29

wdill

What Did It Look Like?
Python
4
star
30

wsdlthesis

ODU WS-DL Thesis/Dissertation LaTeX Template
TeX
3
star
31

oduwsdl.github.io

ODU Web Science and Digital Libraries Research Group (WS-DL) home page.
HTML
3
star
32

odusci-etd-template

ODU College of Sciences LaTeX template for Theses and Dissertations - Overleaf sync
TeX
3
star
33

dsa-puddles

This repository stores the stories, summaries, and other visualizations of the Dark and Stormy Archives Project.
2
star
34

NwalaTextUtils

Collection of functions for processing text
Python
2
star
35

archive_profiles

A repository for collecting profiles of various web archiving services and updating as they evolve.
HTML
2
star
36

SSAuth

Python
2
star
37

top-news-selectors

Top News Selectors (tns): Top news parsing from select websites
HTML
2
star
38

University-Twitter-Engagement

2
star
39

2020DemFollowerGraph

This repository contains Twitter follower growth graphs for 2020 Democratic Party Candidates.
JavaScript
2
star
40

accesslog-parser

Web server access log parser and CLI tool with added features for web archive replay logs
Python
1
star
41

TwitterLabels

Analyzing the issues such as missing labels, temporal violations in archived Twitter using @realDonaldTrump mementos.
R
1
star
42

Analysing-change-in-Twitter-UI

Analysing change in Twitter UI
Python
1
star
43

SampleURLs

A collection of various URI sample setst
1
star
44

access-patterns

Access patterns of robots vs. humans in the Internet Archive and Portuguese web archive using web archive access logs.
Shell
1
star
45

seed-analyzer

Scripts to analyze collection seeds for their diversity and entropy
Python
1
star
46

Recommending-Archived-Webpages

"Expanding the Usage of Web Archives by Recommending Archived Webpages using only the URI
"
1
star
47

MergeArabicNames

Python
1
star
48

2024-research-expo

1
star
49

dsa

Repository for the collective work of the Dark and Stormy Archives project.
Shell
1
star
50

storygraphbot

Python
1
star
51

2022-research-expo

2022 Web Science & Digital Libraries Research Group Expo
1
star
52

utils

Assorted utility scripts for various tasks
Python
1
star
53

2021-research-expo

2021 Web Science & Digital Libraries Research Group Expo -- 2021-04-12, noon-2:30pm EDT
1
star
54

offtopic-goldstandard-data

Data for testing the Offtopic detection software
Python
1
star
55

quality-proxies-framework

Collection of code and data for ACM/IEEE JCDL 2021 paper: "Garbage, Glitter, or Gold: Assigning Multi-dimensional QualityScores to Social Media Seeds for Web Archive Collections"
Scheme
1
star