DigitalPebble/behemoth

This repository has been archived on 10/Jul/2019
Stars
282
Rank 146,549 (Top 3 %)
Language
Java
License
Other
Created over 14 years ago
Updated over 6 years ago

DigitalPebble/behemoth

DigitalPebble

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

Behemoth is an open source platform for large scale document processing based on Apache Hadoop.

It consists of a simple annotation-based implementation of a document and a number of modules operating on these documents. One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale but also to provide reusable modules for :

ingesting from common data sources (Warc, Nutch, etc...)
text processing (Tika, UIMA, GATE, Language Identification)
generating output for external tools (SOLR, Mahout)

Its modular architecture simplifies the development of custom annotators based on MapReduce.

Note that Behemoth does not implement any NLP or Machine Learning components as such but serves as a 'large-scale glueware' for existing resources. Being Hadoop-based, it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community.

WIKI : https://github.com/DigitalPebble/behemoth/wiki

Mailing list : http://groups.google.com/group/digitalpebble

StackOverflow : http://stackoverflow.com/questions/tagged/behemoth

storm-crawler

A scalable, mature and versatile web crawler based on Apache Storm

TextClassification

A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.

textclassification-examples

Use cases for DigitalPebble's TextClassification API

stormcrawlerfight

Crawl configurations for benchmarking / testing StormCrawler

stormcrawler-docker

Resources for running StormCrawler with Docker services

ansible-storm

Ansible playbook for deploying a Storm cluster

TextClassificationPlugin

GATE Processing Resource wrapping DigitalPebble's TextClassification API

ngrams-api

Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format

behemoth-commoncrawl

Support for old (pre 2013) CommonCrawl dataset in Behemoth

tescobank

Setup for crawling tescobank with SC

NutchFight

Resources for comparison between 1.8 and 2.x of Apache Nutch

sc-warc

WARC resources for StormCrawler

behemoth-textclassification

Module for classifying Behemoth documents with a model from our Text Classification API

crawlurlfrontier

Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.

behemoth-elasticsearch

ElasticSearch module for Behemoth

urlfrontier-client

URLFrontier client written in Rust (mostly as a way of learning Rust)