• This repository has been archived on 10/Jul/2019
  • Stars
    star
    4
  • Rank 3,304,323 (Top 66 %)
  • Language
    Java
  • Created over 12 years ago
  • Updated over 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Support for old (pre 2013) CommonCrawl dataset in Behemoth

More Repositories

1

storm-crawler

A scalable, mature and versatile web crawler based on Apache Storm
HTML
842
star
2

behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Java
282
star
3

TextClassification

A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
Java
48
star
4

textclassification-examples

Use cases for DigitalPebble's TextClassification API
Java
10
star
5

stormcrawlerfight

Crawl configurations for benchmarking / testing StormCrawler
Shell
9
star
6

stormcrawler-docker

Resources for running StormCrawler with Docker services
Dockerfile
8
star
7

ansible-storm

Ansible playbook for deploying a Storm cluster
7
star
8

TextClassificationPlugin

GATE Processing Resource wrapping DigitalPebble's TextClassification API
Java
5
star
9

ngrams-api

Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
Java
4
star
10

tescobank

Setup for crawling tescobank with SC
Java
4
star
11

NutchFight

Resources for comparison between 1.8 and 2.x of Apache Nutch
Java
4
star
12

sc-warc

WARC resources for StormCrawler
2
star
13

behemoth-textclassification

Module for classifying Behemoth documents with a model from our Text Classification API
Java
1
star
14

crawlurlfrontier

Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
FLUX
1
star
15

behemoth-elasticsearch

ElasticSearch module for Behemoth
Java
1
star
16

urlfrontier-client

URLFrontier client written in Rust (mostly as a way of learning Rust)
Rust
1
star