There are no reviews yet. Be the first to send feedback to the community and the maintainers!
storm-crawler
A scalable, mature and versatile web crawler based on Apache Stormbehemoth
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.TextClassification
A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.textclassification-examples
Use cases for DigitalPebble's TextClassification APIstormcrawlerfight
Crawl configurations for benchmarking / testing StormCrawlerstormcrawler-docker
Resources for running StormCrawler with Docker servicesansible-storm
Ansible playbook for deploying a Storm clusterTextClassificationPlugin
GATE Processing Resource wrapping DigitalPebble's TextClassification APIngrams-api
Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T formatbehemoth-commoncrawl
Support for old (pre 2013) CommonCrawl dataset in Behemothtescobank
Setup for crawling tescobank with SCNutchFight
Resources for comparison between 1.8 and 2.x of Apache Nutchbehemoth-textclassification
Module for classifying Behemoth documents with a model from our Text Classification APIcrawlurlfrontier
Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.behemoth-elasticsearch
ElasticSearch module for Behemothurlfrontier-client
URLFrontier client written in Rust (mostly as a way of learning Rust)Love Open Source and this site? Check out how you can help us