Helge Holzmann (@helgeho)

Top repositories

1

ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Scala
140
star
2

Web2Warc

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
Scala
24
star
3

HadoopConcatGz

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Java
9
star
4

internetarchive-transfer-scripts

Scripts to transfer archive.org collections, using https://github.com/jjjake/internetarchive
Python
9
star
5

HadoopWebGraph

A Hadoop input format to use gaphs in WebGraph's BV format with Hadoop and Spark.
Java
7
star
6

Exspec

Don't write specs anymore, just save 'em while testing your code interactively. Specs will become a byproduct.
Ruby
5
star
7

IABooksOnArchiveSpark

Analyze digitized books from the Internet Archive remotely with ArchiveSpark
Scala
5
star
8

Micrawler

Create and cite micro Web archives with semantics as temporal representations of objects / entities / concepts on the Web
JavaScript
4
star
9

ArchiveSpark-server

A server application that provides a Web service API for ArchiveSpark to be used by third-party applications to integrate temporal Web archive data with a flexible, easy-to-use interface.
Scala
3
star
10

FEL4ArchiveSpark

Yahoo's Fast Entity Linker for ArchiveSpark
Scala
3
star
11

MHLonArchiveSpark

Work with Medical Heritage Library collections using ArchiveSpark
Scala
2
star
12

ArchiveSpark-Zeppelin-Docker

ArchiveSpark with Zeppelin as ready-to-use Docker image
Shell
2
star
13

ArchiveSpark-AUT-bridge

The compatibility layer between ArchiveSpark and The Archives Unleashed Toolkit (AUT)
Scala
2
star
14

WarcPartitioner

Partition (W)ARC Files by MIME Type and Year
Java
1
star
15

ArchiveSpark-docker

ArchiveSpark on Docker
Python
1
star
16

ArchiveSpark2Triples

Convert web archives to RDF triples with ArchiveSpark
Jupyter Notebook
1
star
17

MapReduceLecture

A lecture on MapReduce with example code
Java
1
star
18

ArchivePig

An Apache Pig framework that facilitates access to Web Archives, enables easy data extraction as well as derivation.
Java
1
star