• This repository has been archived on 08/Jan/2022
  • Stars
    star
    4
  • Rank 3,304,323 (Top 66 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 9 years ago
  • Updated about 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Setup for crawling tescobank with SC

More Repositories

1

storm-crawler

A scalable, mature and versatile web crawler based on Apache Storm
HTML
842
star
2

behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Java
282
star
3

TextClassification

A Text Classification API in Java originally developed by DigitalPebble Ltd. The API is independent from the ML implementations used and can be used as a front end to various ML algorithms. libSVM and liblinear are currently embedded.
Java
48
star
4

textclassification-examples

Use cases for DigitalPebble's TextClassification API
Java
10
star
5

stormcrawlerfight

Crawl configurations for benchmarking / testing StormCrawler
Shell
9
star
6

stormcrawler-docker

Resources for running StormCrawler with Docker services
Dockerfile
8
star
7

ansible-storm

Ansible playbook for deploying a Storm cluster
7
star
8

TextClassificationPlugin

GATE Processing Resource wrapping DigitalPebble's TextClassification API
Java
5
star
9

ngrams-api

Java API for querying a N-Grams corpus. Uses Lucene for searching and indexing from the Google Web-1T format
Java
4
star
10

behemoth-commoncrawl

Support for old (pre 2013) CommonCrawl dataset in Behemoth
Java
4
star
11

NutchFight

Resources for comparison between 1.8 and 2.x of Apache Nutch
Java
4
star
12

sc-warc

WARC resources for StormCrawler
2
star
13

behemoth-textclassification

Module for classifying Behemoth documents with a model from our Text Classification API
Java
1
star
14

crawlurlfrontier

Crawl config used to test URL Frontier on a large scale and produce WARCs for CommonCrawl.
FLUX
1
star
15

behemoth-elasticsearch

ElasticSearch module for Behemoth
Java
1
star
16

urlfrontier-client

URLFrontier client written in Rust (mostly as a way of learning Rust)
Rust
1
star