ArchiveTeam/wpull

Stars
551
Rank 80,726 (Top 2 %)
Language
HTML
License
GNU General Publi...
Created almost 11 years ago
Updated 7 months ago

ArchiveTeam/wpull

ArchiveTeam

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Wget-compatible web downloader and crawler.

Wpull

Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler.

Notable Features:

Written in Python: lightweight, modifiable, robust, & scriptable
Graceful stopping; on-disk database resume
PhantomJS & youtube-dl integration (experimental)

Install

Wpull uses Python 3.

Once Python is installed, download Wpull from PyPI using pip:

pip3 install wpull

For detailed installation instructions and potential caveats, please see https://wpull.readthedocs.io/en/master/install.html.

Example Commands

To download the About page of Google.com:

wpull google.com/about

To archive a website:

wpull billy.blogsite.example \
    --warc-file blogsite-billy \
    --no-check-certificate \
    --no-robots --user-agent "InconspiuousWebBrowser/1.0" \
    --wait 0.5 --random-wait --waitretry 600 \
    --page-requisites --recursive --level inf \
    --span-hosts-allow linked-pages,page-requisites \
    --escaped-fragment --strip-session-id \
    --sitemaps \
    --reject-regex "/login\.php" \
    --tries 3 --retry-connrefused --retry-dns-error \
    --timeout 60 --session-timeout 21600 \
    --delete-after --database blogsite-billy.db \
    --quiet --output-file blogsite-billy.log

To see all options:

wpull --help

Documentation

Documentation is located at https://wpull.readthedocs.io/. Please have a look at it before using Wpull's advanced features.

Help

Need help? Please see our Help page which contains frequently asked questions and support information.

The issue tracker is located at https://github.com/chfoo/wpull/issues.

Dev

Travis CI build status

Coveralls report

Contributions and feedback are greatly appreciated.

Credits

Copyright 2013-2016 by Christopher Foo and others. License GPL v3.

This project contains third-party source code licensed under different terms:

wpull.backport.logging
wpull.thirdparty.robotexclusionrulesparser
wpull.thirdparty.dammit

We would like to acknowledge the authors of GNU Wget as Wpull uses algorithms from Wget.

grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

ArchiveBot

ArchiveBot, an IRC bot for archiving websites

warrior-dockerfile

A Dockerfile for the ArchiveTeam Warrior

parler-grab

Archiving Parler.

Ubuntu-Warrior

Scripts to build and boot warrior virtual machine containing Docker

wget-lua

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

IA.BAK

We back up a lot of stuff from around the web; now it's time to back up the Internet Archive, just in case.

seesaw-kit

Making a reusable toolkit for writing seesaw scripts

terroroftinytown

URLTeam's second generation of URL shortener archiving tools

reddit-grab

Grabbing everything from reddit.

NewsGrabber

Grabbing all news.

yahooanswers-grab

Saving all questions and answers from Yahoo! Answers.

tumblr-grab

Archiving all to-be-deleted NSFW tumblr blogs.

imgur-grab

Archiving imgur.

universal-tracker

A configurable, reusable tracker with dashboard

googleplus-grab

Archiving Google+.

terroroftinytown-client-grab

The Seesaw pipeline grab script for the URLTeam (terroroftinytown) project

ludios_wpull

wpull fork with fixes and faster parsing using html5-parser; used by grab-site; should go away when wpull is similarly improved

warrior-code2

Boot scripts for the ArchiveTeam Warrior 2

ftp-gov-grab

Archiving government FTPs.

warrior-code

WebArchiver

Decentralized web archiving

soundcloud-grab

500px-grab

Archiving https://500px.com/creativecommons

tinyback

A tiny web scraper

gamemaker-sandbox-items

Gamemaker Sandbox Tracker items

youtube-grab

Archiving all metadata from YouTube (everything except videos themselves due to size)

youtube-dislikes-grab

Archiving general youtube video metadata through innertube for dislikes removal.

youtube-dislikes-items

Managing items for youtube-dislikes-grab.

VideoBot

Specialised bot for periodical grabs and video/audio/etc. webpage scrapes.

urlteam-stuff

Urlteam website, code, ... also, PONIES

urls-grab

Archiving URLs (outlinks) from a variety of sources.

NewsGrabber-Warrior

google-sites-grab

Archiving Google Sites Classic.

flickr-grab

Grabbing Flickr images.

pastebin-grab

Archiving pastebin

youtube-items

Managing items for youtube-grab

wget-lua-forum-scripts

Downloading forums posts with Wget+Lua

greader-grab

http://www.archiveteam.org/index.php?title=Google_Reader

ftp-grab

Save all FTP sites!

mediafire-items

Managing items for mediafire-grab.

citeseerxpdf-grab

Grabbing all sources of CiteSeerX.

twitchtv-grab

Grabbing twitch.tv videos

mobileme-grab

Downloading MobileMe

warrior-preseed

Constructing a new warrior VM

ftp-nab

Thinger to download FTP sites

coursera-grab

Saving courses from Coursera.

tinyarchive

Software behind tracker.tinyarchive.org - Warning: Very hacky code

formspring-grab

Downloading Formspring

yahoomessages-grab

Archiving Yahoo Messages

splinder-grab

telegram-grab

Archiving public telegram messages.

ffnet-grab

archiveteam-megawarc-factory

Some scripts to process ArchiveTeam uploads

roblox-grab

Archiving roblox forums.

gamemaker-sandbox-grab

Grabbing sandbox.yoyogames.com

justintv-grab

Grabbing as much of justin.tv's archives as possible

sourceforge-grab

Archiving SourceForge.

grab-base-df

Base Dockerfile for warrior project grab scripts

wikis-grab

Grabbing all wikis.

liveleak-grab

Archiving liveleak.com

tencent-weibo-grab

Archiving Tencent Weibo (t.qq.com), 腾讯微博

imdb-grab

Archiving IMDb.

reddit-items

Managing items for reddit-grab.

flashdomains-grab

Copy of domains-grab for Flash sites.

ftp-queue

Create queue items for ftp-grab.

tumblr-grab-test

Archiving Tumblr blogs (an ArchiveTeam Warrior testing project)

heroku-buildpack-archiveteam-warrior

Heroku buildpack with the Archive Team Warrior

mobileme-index

An index of the MobileMe downloads

twitchtv-items

Managing twitch.tv items.

Universal-tracker-2

A better tracker with more features for ArchiveTeam

eroshare-grab

panoramio-grab

Grabbing everything from panoramio

blingee-grab

Saving all images and content from Blingee.

vidme-grab

Archiving all videos from vid.me.

livejournal-discovery

Discovering items for livejournal-grab.

github-grab

Archiving GitHub

mediafire-grab

Archiving mediafire.com URLs.

furaffinity-grab

Grabbing all images and other stuff from Fur Affinity.

yahoogroups-grab

Archiving Yahoo! Groups.

webs-grab

Archiving webs.com

puush-grab

twitchtv-discovery-grab

Discovering twitch.tv content

vlive-grab

Archiving vlive.tv.

NewsGrabber-Services

The services for NewsGrabber.

parler-items

Managing items for parler-grab.

furaffinity-items

standalone-readme-template

Readme instructions template for manually running pipeline grab scripts outside the warrior

ArchiveBot-agents

Site-specific agents that work with ArchiveBot

ua-grab

Archiving all of .ua.

googlecode-grab

Saving the full Google Code site!

pixiv-2-grab

Archiving pixiv2 images

miiverse-grab

Archiving miiverse

orkut-grab

Download all of Orkut

dpreview-grab

Archiving DPReview

bottle

A statistics monitor for the listerine download project @ Archive Team. Massive hack, no tests.

halo-new-grab

Archiving Halo (round 2)

googleplus-items

Managing items for googleplus-grab and googleplus2-grab.

furaffinity-discovery

scrapy-thingy

Archiving Thingiverse