• This repository has been archived on 10/Jul/2019
  • Stars
    star
    282
  • Rank 142,281 (Top 3 %)
  • Language
    Java
  • License
    Other
  • Created almost 14 years ago
  • Updated about 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

Build Status

Behemoth is an open source platform for large scale document processing based on Apache Hadoop.

It consists of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale but also to provide reusable modules for :

  • ingesting from common data sources (Warc, Nutch, etc...)
  • text processing (Tika, UIMA, GATE, Language Identification)
  • generating output for external tools (SOLR, Mahout)

Its modular architecture simplifies the development of custom annotators based on MapReduce.

Note that Behemoth does not implement any NLP or Machine Learning components as such but serves as a 'large-scale glueware' for existing resources. Being Hadoop-based, it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community.

WIKI : https://github.com/DigitalPebble/behemoth/wiki

Mailing list : http://groups.google.com/group/digitalpebble

StackOverflow : http://stackoverflow.com/questions/tagged/behemoth

More Repositories

1

storm-crawler

A scalable, mature and versatile web crawler based on Apache Storm
HTML
842
star
2

TextClassification

A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
Java
48
star
3

textclassification-examples

Use cases for DigitalPebble's TextClassification API
Java
10
star
4

stormcrawlerfight

Crawl configurations for benchmarking / testing StormCrawler
Shell
9
star
5

stormcrawler-docker

Resources for running StormCrawler with Docker services
Dockerfile
7
star
6

ansible-storm

Ansible playbook for deploying a Storm cluster
7
star
7

TextClassificationPlugin

GATE Processing Resource wrapping DigitalPebble's TextClassification API
Java
5
star
8

ngrams-api

Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
Java
4
star
9

behemoth-commoncrawl

Support for old (pre 2013) CommonCrawl dataset in Behemoth
Java
4
star
10

NutchFight

Resources for comparison between 1.8 and 2.x of Apache Nutch
Java
4
star
11

tescobank

Setup for crawling tescobank with SC
Java
4
star
12

sc-warc

WARC resources for StormCrawler
2
star
13

behemoth-textclassification

Module for classifying Behemoth documents with a model from our Text Classification API
Java
1
star
14

crawlurlfrontier

Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
FLUX
1
star
15

behemoth-elasticsearch

ElasticSearch module for Behemoth
Java
1
star
16

urlfrontier-client

URLFrontier client written in Rust (mostly as a way of learning Rust)
Rust
1
star