• Stars
    star
    3
  • Rank 3,944,053 (Top 79 %)
  • Language
    Shell
  • Created over 11 years ago
  • Updated almost 11 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

WAT (web archive transform) metadata mining

More Repositories

1

webarchive-discovery

WARC and ARC indexing and discovery tools.
Java
113
star
2

shine

Prototype SOLR-powered web archive exploration UI.
JavaScript
42
star
3

webarchive-explorer

Tools for exploring the contents of web archive files.
Java
39
star
4

docker-pdf2htmlex

Run pdf2htmlEX in a Docker container.
Python
23
star
5

w3act

w3act is an annotation and curation tool for building web archive collections
Java
19
star
6

opendata

Repository of documentation about the open datasets published by the UK Web Archive.
HTML
14
star
7

monitrix

A monitoring system for Heritrix 3.
Java
12
star
8

ukwa-pywb

JavaScript
11
star
9

qaop

Qaop – ZX Spectrum emulator
Java
10
star
10

ukwa-manage

Shepherding our web archives from crawl to access.
Jupyter Notebook
10
star
11

ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
Java
9
star
12

webarchive-test-suite

A set of test files for web archiving.
Arc
8
star
13

docker-brozzler

Brozzler in a Docker container
Shell
7
star
14

crawl-analysis

Web Archiving Domain Crawl Analysis Scripts
Jupyter Notebook
7
star
15

webarchiving-notebooks

A collection of Jupyter notebooks for working with web archive data, tools and APIs
Jupyter Notebook
7
star
16

ukwa-gsheets-utils

Add-On for Google Sheets to help those working with web archives.
JavaScript
6
star
17

webrender-phantomjs

A RESTful API for rendering web pages in PhantomJS
Python
6
star
18

flashfreeze

A rapid web page analyser and archiver.
Python
6
star
19

halflife

Tracking the fortunes of our archived URLs.
Jupyter Notebook
5
star
20

wren

Experiments in testable, scaleable crawler architectures
PHP
5
star
21

aho-corasick

Aho-Corasick in Java
Java
4
star
22

ukwa-services

Deployment configuration for all UKWA services stacks.
Python
4
star
23

mementoweb-webclient

A simple web-based interface to Memento holdings.
Java
4
star
24

acid-crawl

An acid test suite for crawlers.
PHP
4
star
25

ukwa-documentation

Public documentation about the technical architecture of the UK Web Archive
Jupyter Notebook
4
star
26

docker-warcprox

Run warcprox inside Docker
Python
3
star
27

solr-proxy

An NGINX proxy to control access to the Solr API.
Dockerfile
3
star
28

python-warcwriterpool

Hopefully off-setting some of the difficulties writing to WARCs (multiple open files, size limits, etc.).
Python
2
star
29

ukwa-warc-server

Serves our WARC files for playback, wherever they may lie.
Python
2
star
30

ukwa

UKWA
Java
2
star
31

waybacks

This module builds our Waybacks in the various different configurations we require.
Java
2
star
32

webrender-puppeteer

Web page rendering service based on Google's Puppeteer
JavaScript
2
star
33

webarchive-fuse

Use FUSE-J to mount web archive files as filesystems.
Java
2
star
34

javaswf

Mavenised version of the JavaSWF codebase, in order to resolve the dependencies for Heritrix3.
Java
2
star
35

glean

Using web scrapers to extract data from the archived web
Python
2
star
36

ukwa-player

Highly experimental sketch of a hi-fidelity web archive 'player' for proxy-based access
JavaScript
2
star
37

docker-airflow

Apache Airflow with a few additional dependencies
Dockerfile
1
star
38

docker-hadoop

Hadoop running in a container.
Dockerfile
1
star
39

ukwa.github.com

UK Web Archive GitHub Homepage
CSS
1
star
40

docker-grobid

GROBID (GeneRation Of BIbliographic Data) in a Docker container.
1
star
41

httpfs

Apache Hadoop HttpFS for cdh3
Java
1
star
42

ukwa-blacklight

Experimenting with Blacklight
Ruby
1
star
43

ukwa-access-api

An application to wrap up APIs for accessing UKWA content.
Python
1
star
44

python-w3act

Python clients for W3ACT and Heritrix3
Python
1
star
45

mementoweb-client-java

Core Java libraries for Memento clients.
Java
1
star
46

crawl-test-site

A simple site that uses GitHub pages to host resources for testing crawlers.
CSS
1
star
47

language-detection

Experimenting with https://code.google.com/p/language-detection/
PHP
1
star
48

file-archive-recordreader

File Archive RecordReader
Java
1
star
49

python-webhdfs

Python wrapper around Hadoop's WebHDFS interface.
Python
1
star
50

docker-hypercored

A containerised Dat server for experimental dataset hosting.
Dockerfile
1
star
51

crawl-db

A standalone database for crawl events.
Python
1
star
52

ukwa-tasks

Luigi tasks for running Hadoop jobs and managing material held on HDFS
Python
1
star
53

katacoda-scenarios

Katacoda Scenarios
Shell
1
star
54

ukwa-ingest-services

The dockerized ensemble of services that run most of the UKWA crawl and ingest processes.
Shell
1
star
55

hdfs-exporter

Scrapes the Hadoop status pages for Prometheus
Python
1
star