• This repository has been archived on 28/Apr/2018
  • Stars
    star
    161
  • Rank 233,470 (Top 5 %)
  • Language
    Java
  • Created over 11 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Warcbase is an open-source platform for managing analyzing web archives

Warcbase

Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark.

Bad news: Warcbase is defunct and no longer under active development!

Good news: In June 2017, the University of Waterloo and York University were awarded a grant from the Andrew W. Mellon Foundation to build the next generation of tools that will make historical internet content accessible to scholars. Warcbase serves as the foundation for the ArchivesUnleashed Toolkit!

If you're interested in reading about the development of Warcbase, check out this article:

Jimmy Lin, Ian Milligan, Jeremy Wiebe, and Alice Zhou. Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives. ACM Journal on Computing and Cultural Heritage, 10(4), Article 22, 2017.

License

Licensed under the Apache License, Version 2.0.

Acknowledgments

This work has been supported in part by the U.S. National Science Foundation, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, the Ontario Ministry of Research and Innovation's Early Researcher Award program, and the Mellon Foundation (via Columbia University). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.

More Repositories

1

MapReduceAlgorithms

Data-Intensive Text Processing with MapReduce
TeX
620
star
2

guide

The Student's Guide to @lintool
280
star
3

Cloud9

Cloud9 is a Hadoop toolkit for working with big data
Java
236
star
4

twitter-tools

Twitter Tools
Java
217
star
5

Mr.LDA

Scalable Topic Modeling using Variational Inference in MapReduce
Java
149
star
6

bespin

Reference implementations of data-intensive algorithms in MapReduce and Spark
Java
81
star
7

Ivory

A Hadoop toolkit for web-scale information retrieval research
Java
79
star
8

bigdata-2018w

CS 451/651 431/631 Data-Intensive Distribute Computing (Winter 2018) at the University of Waterloo
HTML
71
star
9

bigcows

Scrapes citation statistics from Google Scholar
JavaScript
59
star
10

UMD-courses

Course homepages for courses that I've taught at the University of Maryland
HTML
53
star
11

IR-Reproducibility

Open-Source Information Retrieval Reproducibility Challenge
Shell
50
star
12

my-data-is-bigger-than-your-data

My data is bigger than your data!
HTML
39
star
13

SparkTutorial

Spark Tutorial at the University of Maryland
38
star
14

bigdata-2016w

CS 489/698 Big Data Infrastructure (Winter 2016) at the University of Waterloo
HTML
38
star
15

wikiclean

A Java Wikipedia markup to plain text converter
Java
37
star
16

clueweb

Hadoop tools for manipulating ClueWeb collections
Java
26
star
17

chrome-archive-this-page

Internet Archive "Save a Page" Plug-In for Chrome
JavaScript
23
star
18

bigdata-2018f

CS 451/651 Data-Intensive Distribute Computing (Fall 2018) at the University of Waterloo
HTML
23
star
19

tools

Lintools: tools by @lintool
Java
22
star
20

art-science-empirical-cs-2022f

The Art and Science of Empirical Computer Science (Fall 2022)
20
star
21

bigdata-2017w

CS 489/698 Big Data Infrastructure (Winter 2017) at the University of Waterloo
HTML
15
star
22

TweetAnalysisWithSpark

Tweet Analysis with Spark
Scala
15
star
23

robust04-analysis

Meta-Analysis of Robust04 Papers (Yang et al., SIGIR 2019)
Python
12
star
24

JScene

A proof-of-concept in-browser JavaScript-based search engine
JavaScript
12
star
25

JASS

Anytime Ranking for Impact-Ordered Indexes
C
12
star
26

GrimmerSenatePressReleases

Grimmer's Senate Press Releases
Python
10
star
27

Enron2mbox

Converting the Enron email collection to mbox format
Python
10
star
28

OptTrees

Source code for: Nima Asadi, Jimmy Lin, and Arjen P. de Vries. Runtime Optimizations for Tree-Based Machine Learning Models. IEEE Transactions on Knowledge and Data Engineering, 26(9):2281-2292, 2014.
C
9
star
29

non-blind-review

My proposal for non-blind reviewing at *ACL
6
star
30

art-science-empirical-cs-2023f

The Art and Science of Empirical Computer Science (Fall 2023)
6
star
31

IR-Reproducibility2

The Replicability of IR Replicability Experiments
Shell
5
star
32

UROC-projects

Undergraduate Research Opportunities Conference sponsored by the University of Waterloo
5
star
33

ClueWeb09-TREC-LTR

learning-to-rank dataset extracted from ClueWeb09 using TREC judgments
5
star
34

Cassovary-vs-GraphJet

Performance comparison between Cassovary and GraphJet
5
star
35

bespin-data

Datasets for Bespin
Python
4
star
36

Tweets2013-stats

4
star
37

robust04-analysis-papers

4
star
38

AnseriniMaven

Maven repo for some Anserini dependencies.
3
star
39

nyt-covid-map

HTML
3
star
40

c-bfscan

Implementations of brute force scans for document retrieval in C
C
3
star
41

MSMARCO-Document-Ranking-Archive.test

CSS
2
star
42

GiraphTutorial

Giraph Tutorial
2
star
43

MSMARCO-Document-Ranking-Archive

Python
2
star
44

Zambezi

Real-time indexer and search engine
C
2
star
45

chrome-archive-this-page-crx

Packaged CRX distribution for Internet Archive "Save a Page" Plug-In
2
star
46

NSF-projects

NSF project homepages
CSS
2
star
47

bfscan

Document retrieval using brute force scans
Java
2
star
48

wiki-tools

Collection of tools for working with Wikipedia
Java
2
star
49

msmarco-docker

Dockerfile
2
star
50

tools-javadoc

HTML
2
star
51

hadoop1-data

1
star
52

IR-Reproducibility-exp

Experimental runs from the Open-Source Information Retrieval Reproducibility Challenge.
MAXScript
1
star
53

TweetTap

Simple program to tap the Twitter sample stream
Java
1
star
54

chrome-scholar-search-extension

Google Scholar Search Extension for Chrome
JavaScript
1
star
55

trec-mb-vis

Visualization of TREC Microblog Track relevance judgments
JavaScript
1
star
56

clueweb09en01-webgraph

Webgraph for ClueWeb09 Category B
1
star
57

cs-big-cows

List of people with great achievements in Computer Science
Python
1
star