• Stars
    star
    1
  • Language
    Shell
  • License
    Apache License 2.0
  • Created almost 8 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The dockerized ensemble of services that run most of the UKWA crawl and ingest processes.

More Repositories

1

webarchive-discovery

WARC and ARC indexing and discovery tools.
Java
113
star
2

shine

Prototype SOLR-powered web archive exploration UI.
JavaScript
42
star
3

webarchive-explorer

Tools for exploring the contents of web archive files.
Java
39
star
4

docker-pdf2htmlex

Run pdf2htmlEX in a Docker container.
Python
23
star
5

w3act

w3act is an annotation and curation tool for building web archive collections
Java
19
star
6

opendata

Repository of documentation about the open datasets published by the UK Web Archive.
HTML
14
star
7

monitrix

A monitoring system for Heritrix 3.
Java
12
star
8

ukwa-pywb

JavaScript
11
star
9

qaop

Qaop – ZX Spectrum emulator
Java
10
star
10

ukwa-manage

Shepherding our web archives from crawl to access.
Jupyter Notebook
10
star
11

ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
Java
9
star
12

webarchive-test-suite

A set of test files for web archiving.
Arc
8
star
13

docker-brozzler

Brozzler in a Docker container
Shell
7
star
14

crawl-analysis

Web Archiving Domain Crawl Analysis Scripts
Jupyter Notebook
7
star
15

webarchiving-notebooks

A collection of Jupyter notebooks for working with web archive data, tools and APIs
Jupyter Notebook
7
star
16

ukwa-gsheets-utils

Add-On for Google Sheets to help those working with web archives.
JavaScript
6
star
17

webrender-phantomjs

A RESTful API for rendering web pages in PhantomJS
Python
6
star
18

flashfreeze

A rapid web page analyser and archiver.
Python
6
star
19

halflife

Tracking the fortunes of our archived URLs.
Jupyter Notebook
5
star
20

wren

Experiments in testable, scaleable crawler architectures
PHP
5
star
21

aho-corasick

Aho-Corasick in Java
Java
4
star
22

ukwa-services

Deployment configuration for all UKWA services stacks.
Python
4
star
23

mementoweb-webclient

A simple web-based interface to Memento holdings.
Java
4
star
24

acid-crawl

An acid test suite for crawlers.
PHP
4
star
25

ukwa-documentation

Public documentation about the technical architecture of the UK Web Archive
Jupyter Notebook
4
star
26

webarchive-wat-mining

WAT (web archive transform) metadata mining
Shell
3
star
27

docker-warcprox

Run warcprox inside Docker
Python
3
star
28

solr-proxy

An NGINX proxy to control access to the Solr API.
Dockerfile
3
star
29

python-warcwriterpool

Hopefully off-setting some of the difficulties writing to WARCs (multiple open files, size limits, etc.).
Python
2
star
30

ukwa-warc-server

Serves our WARC files for playback, wherever they may lie.
Python
2
star
31

ukwa

UKWA
Java
2
star
32

waybacks

This module builds our Waybacks in the various different configurations we require.
Java
2
star
33

webrender-puppeteer

Web page rendering service based on Google's Puppeteer
JavaScript
2
star
34

webarchive-fuse

Use FUSE-J to mount web archive files as filesystems.
Java
2
star
35

javaswf

Mavenised version of the JavaSWF codebase, in order to resolve the dependencies for Heritrix3.
Java
2
star
36

glean

Using web scrapers to extract data from the archived web
Python
2
star
37

ukwa-player

Highly experimental sketch of a hi-fidelity web archive 'player' for proxy-based access
JavaScript
2
star
38

docker-airflow

Apache Airflow with a few additional dependencies
Dockerfile
1
star
39

docker-hadoop

Hadoop running in a container.
Dockerfile
1
star
40

ukwa.github.com

UK Web Archive GitHub Homepage
CSS
1
star
41

docker-grobid

GROBID (GeneRation Of BIbliographic Data) in a Docker container.
1
star
42

httpfs

Apache Hadoop HttpFS for cdh3
Java
1
star
43

ukwa-blacklight

Experimenting with Blacklight
Ruby
1
star
44

ukwa-access-api

An application to wrap up APIs for accessing UKWA content.
Python
1
star
45

python-w3act

Python clients for W3ACT and Heritrix3
Python
1
star
46

mementoweb-client-java

Core Java libraries for Memento clients.
Java
1
star
47

crawl-test-site

A simple site that uses GitHub pages to host resources for testing crawlers.
CSS
1
star
48

language-detection

Experimenting with https://code.google.com/p/language-detection/
PHP
1
star
49

file-archive-recordreader

File Archive RecordReader
Java
1
star
50

python-webhdfs

Python wrapper around Hadoop's WebHDFS interface.
Python
1
star
51

docker-hypercored

A containerised Dat server for experimental dataset hosting.
Dockerfile
1
star
52

crawl-db

A standalone database for crawl events.
Python
1
star
53

ukwa-tasks

Luigi tasks for running Hadoop jobs and managing material held on HDFS
Python
1
star
54

katacoda-scenarios

Katacoda Scenarios
Shell
1
star
55

hdfs-exporter

Scrapes the Hadoop status pages for Prometheus
Python
1
star