There are no reviews yet. Be the first to send feedback to the community and the maintainers!
1. ArchiveBot <SketchCow> Coders, I have a question. <SketchCow> Or, a request, etc. <SketchCow> I spent some time with xmc discussing something we could do to make things easier around here. <SketchCow> What we came up with is a trigger for a bot, which can be triggered by people with ops. <SketchCow> You tell it a website. It crawls it. WARC. Uploads it to archive.org. Boom. <SketchCow> I can supply machine as needed. <SketchCow> Obviously there's some sanitation issues, and it is root all the way down or nothing. <SketchCow> I think that would help a lot for smaller sites <SketchCow> Sites where it's 100 pages or 1000 pages even, pretty simple. <SketchCow> And just being able to go "bot, get a sanity dump" 2. More info ArchiveBot has two major backend components: the control node, which runs the IRC interface and bookkeeping programs, and the crawlers, which do all the Web crawling. ArchiveBot users communicate with ArchiveBot by issuing commands in an IRC channel. User's guide: http://archivebot.readthedocs.org/en/latest/ Control node installation guide: INSTALL.backend Crawler installation guide: INSTALL.pipeline 3. Local use ArchiveBot was originally written as a set of separate programs for deployment on a server. This means it has a poor distribution story. However, Ivan Kozik (@ivan) has taken the ArchiveBot pipeline, dashboard, ignores, and control system and created a package intended for personal use. You can find it at https://github.com/ArchiveTeam/grab-site. 4. License Copyright 2013 David Yip; made available under the MIT license. See LICENSE for details. 5. Acknowledgments Thanks to Alard (@alard), who added WARC generation and Lua scripting to GNU Wget. Wget+lua was the first web crawler used by ArchiveBot. Thanks to Christopher Foo (@chfoo) for wpull, ArchiveBot's current web crawler. Thanks to Ivan Kozik (@ivan) for maintaining ignore patterns and tracking down performance problems at scale. Other thanks go to the following projects: * Celluloid <http://celluloid.io/> * Cinch <https://github.com/cinchrb/cinch/> * CouchDB <http://couchdb.apache.org/> * Ember.js <http://emberjs.com/> * Redis <http://redis.io/> * Seesaw <https://github.com/ArchiveTeam/seesaw-kit> 6. Special thanks Dragonette, Barnaby Bright, Vienna Teng, NONONO. The memory hole of the Web has gone too far. Don't look down, never look away; ArchiveBot's like the wind. vim:ts=2:sw=2:tw=72:et
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patternswpull
Wget-compatible web downloader and crawler.warrior-dockerfile
A Dockerfile for the ArchiveTeam Warriorparler-grab
Archiving Parler.Ubuntu-Warrior
Scripts to build and boot warrior virtual machine containing Dockerwget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.IA.BAK
We back up a lot of stuff from around the web; now it's time to back up the Internet Archive, just in case.seesaw-kit
Making a reusable toolkit for writing seesaw scriptsterroroftinytown
URLTeam's second generation of URL shortener archiving toolsreddit-grab
Grabbing everything from reddit.NewsGrabber
Grabbing all news.yahooanswers-grab
Saving all questions and answers from Yahoo! Answers.tumblr-grab
Archiving all to-be-deleted NSFW tumblr blogs.imgur-grab
Archiving imgur.universal-tracker
A configurable, reusable tracker with dashboardgoogleplus-grab
Archiving Google+.terroroftinytown-client-grab
The Seesaw pipeline grab script for the URLTeam (terroroftinytown) projectludios_wpull
wpull fork with fixes and faster parsing using html5-parser; used by grab-site; should go away when wpull is similarly improvedwarrior-code2
Boot scripts for the ArchiveTeam Warrior 2ftp-gov-grab
Archiving government FTPs.warrior-code
WebArchiver
Decentralized web archivingsoundcloud-grab
500px-grab
Archiving https://500px.com/creativecommonstinyback
A tiny web scrapergamemaker-sandbox-items
Gamemaker Sandbox Tracker itemsyoutube-grab
Archiving all metadata from YouTube (everything except videos themselves due to size)youtube-dislikes-grab
Archiving general youtube video metadata through innertube for dislikes removal.youtube-dislikes-items
Managing items for youtube-dislikes-grab.VideoBot
Specialised bot for periodical grabs and video/audio/etc. webpage scrapes.urlteam-stuff
Urlteam website, code, ... also, PONIESurls-grab
Archiving URLs (outlinks) from a variety of sources.NewsGrabber-Warrior
google-sites-grab
Archiving Google Sites Classic.flickr-grab
Grabbing Flickr images.pastebin-grab
Archiving pastebinyoutube-items
Managing items for youtube-grabwget-lua-forum-scripts
Downloading forums posts with Wget+Luagreader-grab
http://www.archiveteam.org/index.php?title=Google_Readerftp-grab
Save all FTP sites!mediafire-items
Managing items for mediafire-grab.citeseerxpdf-grab
Grabbing all sources of CiteSeerX.twitchtv-grab
Grabbing twitch.tv videosmobileme-grab
Downloading MobileMewarrior-preseed
Constructing a new warrior VMftp-nab
Thinger to download FTP sitescoursera-grab
Saving courses from Coursera.tinyarchive
Software behind tracker.tinyarchive.org - Warning: Very hacky codeformspring-grab
Downloading Formspringyahoomessages-grab
Archiving Yahoo Messagessplinder-grab
telegram-grab
Archiving public telegram messages.ffnet-grab
Fanfictioningarchiveteam-megawarc-factory
Some scripts to process ArchiveTeam uploadsroblox-grab
Archiving roblox forums.gamemaker-sandbox-grab
Grabbing sandbox.yoyogames.comjustintv-grab
Grabbing as much of justin.tv's archives as possiblesourceforge-grab
Archiving SourceForge.grab-base-df
Base Dockerfile for warrior project grab scriptswikis-grab
Grabbing all wikis.liveleak-grab
Archiving liveleak.comtencent-weibo-grab
Archiving Tencent Weibo (t.qq.com), 腾讯微博imdb-grab
Archiving IMDb.reddit-items
Managing items for reddit-grab.flashdomains-grab
Copy of domains-grab for Flash sites.ftp-queue
Create queue items for ftp-grab.tumblr-grab-test
Archiving Tumblr blogs (an ArchiveTeam Warrior testing project)heroku-buildpack-archiveteam-warrior
Heroku buildpack with the Archive Team Warriormobileme-index
An index of the MobileMe downloadstwitchtv-items
Managing twitch.tv items.Universal-tracker-2
A better tracker with more features for ArchiveTeameroshare-grab
panoramio-grab
Grabbing everything from panoramioblingee-grab
Saving all images and content from Blingee.vidme-grab
Archiving all videos from vid.me.livejournal-discovery
Discovering items for livejournal-grab.github-grab
Archiving GitHubmediafire-grab
Archiving mediafire.com URLs.furaffinity-grab
Grabbing all images and other stuff from Fur Affinity.yahoogroups-grab
Archiving Yahoo! Groups.webs-grab
Archiving webs.compuush-grab
twitchtv-discovery-grab
Discovering twitch.tv contentvlive-grab
Archiving vlive.tv.NewsGrabber-Services
The services for NewsGrabber.parler-items
Managing items for parler-grab.furaffinity-items
standalone-readme-template
Readme instructions template for manually running pipeline grab scripts outside the warriorArchiveBot-agents
Site-specific agents that work with ArchiveBotua-grab
Archiving all of .ua.googlecode-grab
Saving the full Google Code site!pixiv-2-grab
Archiving pixiv2 imagesmiiverse-grab
Archiving miiverseorkut-grab
Download all of Orkutdpreview-grab
Archiving DPReviewbottle
A statistics monitor for the listerine download project @ Archive Team. Massive hack, no tests.halo-new-grab
Archiving Halo (round 2)googleplus-items
Managing items for googleplus-grab and googleplus2-grab.furaffinity-discovery
scrapy-thingy
Archiving ThingiverseLove Open Source and this site? Check out how you can help us