Discover ukwa/webarchive-fuse Open Source project

Prototype SOLR-powered web archive exploration UI.

113

shine

Tools for exploring the contents of web archive files.

webarchive-explorer

Run pdf2htmlEX in a Docker container.

docker-pdf2htmlex

w3act is an annotation and curation tool for building web archive collections

w3act

Repository of documentation about the open datasets published by the UK Web Archive.

opendata

HTML

monitrix

A monitoring system for Heritrix 3.

ukwa-pywb

Qaop – ZX Spectrum emulator

qaop

Shepherding our web archives from crawl to access.

ukwa-manage

The UKWA Heritrix3 custom modules and Docker builder.

ukwa-heritrix

A set of test files for web archiving.

webarchive-test-suite

Arc

docker-brozzler

Brozzler in a Docker container

Web Archiving Domain Crawl Analysis Scripts

crawl-analysis

A collection of Jupyter notebooks for working with web archive data, tools and APIs

webarchiving-notebooks

Add-On for Google Sheets to help those working with web archives.

ukwa-gsheets-utils

A RESTful API for rendering web pages in PhantomJS

webrender-phantomjs

A rapid web page analyser and archiver.

flashfreeze

Tracking the fortunes of our archived URLs.

halflife

Experiments in testable, scaleable crawler architectures

wren

PHP

aho-corasick

Aho-Corasick in Java

Deployment configuration for all UKWA services stacks.

ukwa-services

A simple web-based interface to Memento holdings.

mementoweb-webclient

An acid test suite for crawlers.

acid-crawl

PHP

ukwa-documentation

Public documentation about the technical architecture of the UK Web Archive

WAT (web archive transform) metadata mining

webarchive-wat-mining

Run warcprox inside Docker

docker-warcprox

An NGINX proxy to control access to the Solr API.

solr-proxy

Hopefully off-setting some of the difficulties writing to WARCs (multiple open files, size limits, etc.).

python-warcwriterpool

Serves our WARC files for playback, wherever they may lie.

ukwa-warc-server

ukwa

UKWA

This module builds our Waybacks in the various different configurations we require.

waybacks

Web page rendering service based on Google's Puppeteer

webrender-puppeteer

Mavenised version of the JavaSWF codebase, in order to resolve the dependencies for Heritrix3.

javaswf

Using web scrapers to extract data from the archived web

glean

Highly experimental sketch of a hi-fidelity web archive 'player' for proxy-based access

ukwa-player

Apache Airflow with a few additional dependencies

docker-airflow

Hadoop running in a container.

docker-hadoop

UK Web Archive GitHub Homepage

ukwa.github.com

CSS

docker-grobid

GROBID (GeneRation Of BIbliographic Data) in a Docker container.

httpfs

Apache Hadoop HttpFS for cdh3

Experimenting with Blacklight

ukwa-blacklight

Ruby

ukwa-access-api

An application to wrap up APIs for accessing UKWA content.

Python clients for W3ACT and Heritrix3

python-w3act

Core Java libraries for Memento clients.

mementoweb-client-java

A simple site that uses GitHub pages to host resources for testing crawlers.

crawl-test-site

CSS

language-detection

Experimenting with https://code.google.com/p/language-detection/

PHP

file-archive-recordreader

File Archive RecordReader

Python wrapper around Hadoop's WebHDFS interface.

python-webhdfs

A containerised Dat server for experimental dataset hosting.

docker-hypercored

A standalone database for crawl events.

crawl-db

Luigi tasks for running Hadoop jobs and managing material held on HDFS

ukwa-tasks

katacoda-scenarios

Katacoda Scenarios

The dockerized ensemble of services that run most of the UKWA crawl and ingest processes.

ukwa-ingest-services

Scrapes the Hadoop status pages for Prometheus

hdfs-exporter