• Stars
    star
    2
  • Language
    Shell
  • Created about 8 years ago
  • Updated about 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ArchiveSpark with Zeppelin as ready-to-use Docker image

More Repositories

1

ArchiveSpark

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Scala
143
star
2

Web2Warc

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
Scala
24
star
3

HadoopConcatGz

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Java
9
star
4

internetarchive-transfer-scripts

Scripts to transfer archive.org collections, using https://github.com/jjjake/internetarchive
Python
9
star
5

HadoopWebGraph

A Hadoop input format to use gaphs in WebGraph's BV format with Hadoop and Spark.
Java
7
star
6

Exspec

Don't write specs anymore, just save 'em while testing your code interactively. Specs will become a byproduct.
Ruby
5
star
7

IABooksOnArchiveSpark

Analyze digitized books from the Internet Archive remotely with ArchiveSpark
Scala
5
star
8

Micrawler

Create and cite micro Web archives with semantics as temporal representations of objects / entities / concepts on the Web
JavaScript
4
star
9

ArchiveSpark-server

A server application that provides a Web service API for ArchiveSpark to be used by third-party applications to integrate temporal Web archive data with a flexible, easy-to-use interface.
Scala
3
star
10

FEL4ArchiveSpark

Yahoo's Fast Entity Linker for ArchiveSpark
Scala
3
star
11

MHLonArchiveSpark

Work with Medical Heritage Library collections using ArchiveSpark
Scala
2
star
12

ArchiveSpark-AUT-bridge

The compatibility layer between ArchiveSpark and The Archives Unleashed Toolkit (AUT)
Scala
2
star
13

WarcPartitioner

Partition (W)ARC Files by MIME Type and Year
Java
1
star
14

ArchiveSpark-docker

ArchiveSpark on Docker
Python
1
star
15

ArchiveSpark2Triples

Convert web archives to RDF triples with ArchiveSpark
Jupyter Notebook
1
star
16

MapReduceLecture

A lecture on MapReduce with example code
Java
1
star
17

ArchivePig

An Apache Pig framework that facilitates access to Web Archives, enables easy data extraction as well as derivation.
Java
1
star