• Stars
    star
    445
  • Rank 97,430 (Top 2 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 9 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ACHE is a web crawler for domain-specific search.

Build Status Docker Build Documentation Status License

ACHE Focused Crawler

ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern. ACHE differs from generic crawlers in sense that it uses page classifiers to distinguish between relevant and irrelevant pages in a given domain. A page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a machine-learning based classification model. ACHE can also automatically learn how to prioritize links in order to efficiently locate relevant content while avoiding the retrieval of irrelevant content.

ACHE supports many features, such as:

  • Regular crawling of a fixed list of web sites
  • Discovery and crawling of new relevant web sites through automatic link prioritization
  • Configuration of different types of pages classifiers (machine-learning, regex, etc)
  • Continuous re-crawling of sitemaps to discover new pages
  • Indexing of crawled pages using Elasticsearch
  • Web interface for searching crawled pages in real-time
  • REST API and web-based user interface for crawler monitoring
  • Crawling of hidden services using TOR proxies

License

Starting from version 0.11.0 onwards, ACHE is licensed under Apache 2.0. Previous versions were licensed under GNU GPL license.

Documentation

More info is available on the project's documentation.

Installation

You can either build ACHE from the source code, download the executable binary using conda, or use Docker to build an image and run ACHE in a container.

Build from source with Gradle

Prerequisite: You will need to install recent version of Java (JDK 8 or latest).

To build ACHE from source, you can run the following commands in your terminal:

git clone https://github.com/ViDA-NYU/ache.git
cd ache
./gradlew installDist

which will generate an installation package under ache/build/install/. You can then make ache command available in the terminal by adding ACHE binaries to the PATH environment variable:

export ACHE_HOME="{path-to-cloned-ache-repository}/ache/build/install/ache"
export PATH="$ACHE_HOME/bin:$PATH"

Running using Docker

Prerequisite: You will need to install a recent version of Docker. See https://docs.docker.com/engine/installation/ for details on how to install Docker for your platform.

We publish pre-built docker images on Docker Hub for each released version. You can run the latest image using:

docker run -p 8080:8080 vidanyu/ache:latest

Alternatively, you can build the image yourself and run it:

git clone https://github.com/ViDA-NYU/ache.git
cd ache
docker build -t ache .
docker run -p 8080:8080 ache

The Dockerfile exposes two data volumes so that you can mount a directory with your configuration files (at /config) and preserve the crawler stored data (at /data) after the container stops.

Download with Conda

Prerequisite: You need to have Conda package manager installed in your system.

If you use Conda, you can install ache from Anaconda Cloud by running:

conda install -c vida-nyu ache

NOTE: Only released tagged versions are published to Anaconda Cloud, so the version available through Conda may not be up-to-date. If you want to try the most recent version, please clone the repository and build from source or use the Docker version.

Running ACHE

Before starting a crawl, you need to create a configuration file named ache.yml. We provide some configuration samples in the repository's config directory that can help you to get started.

You will also need a page classifier configuration file named pageclassifier.yml. For details on how configure a page classifier, refer to the page classifiers documentation.

After you have configured a classifier, the last thing you will need is a seed file, i.e, a plain text containing one URL per line. The crawler will use these URLs to bootstrap the crawl.

Finally, you can start the crawler using the following command:

ache startCrawl -o <data-output-path> -c <config-path> -s <seed-file> -m <model-path>

where,

  • <configuration-path> is the path to the config directory that contains ache.yml.
  • <seed-file> is the seed file that contains the seed URLs.
  • <model-path> is the path to the model directory that contains the file pageclassifier.yml.
  • <data-output-path> is the path to the data output directory.

Example of running ACHE using the sample pre-trained page classifier model and the sample seeds file available in the repository:

ache startCrawl -o output -c config/sample_config -s config/sample.seeds -m config/sample_model

The crawler will run and print the logs to the console. Hit Ctrl+C at any time to stop it (it may take some time). For long crawls, you should run ACHE in background using a tool like nohup.

Data Formats

ACHE can output data in multiple formats. The data formats currently available are:

  • FILES (default) - raw content and metadata is stored in rolling compressed files of fixed size.
  • ELATICSEARCH - raw content and metadata is indexed in an ElasticSearch index.
  • KAFKA - pushes raw content and metadata to an Apache Kafka topic.
  • WARC - stores data using the standard format used by the Web Archive and Common Crawl.
  • FILESYSTEM_HTML - only raw page content is stored in plain text files.
  • FILESYSTEM_JSON - raw content and metadata is stored using JSON format in files.
  • FILESYSTEM_CBOR - raw content and some metadata is stored using CBOR format in files.

For more details on how to configure data formats, see the data formats documentation page.

Bug Reports and Questions

We welcome user feedback. Please submit any suggestions, questions or bug reports using the Github issue tracker.

We also have a chat room on Gitter.

Contributing

Code contributions are welcome. We use a code style derived from the Google Style Guide, but with 4 spaces for tabs. A Eclipse Formatter configuration file is available in the repository.

Contact

More Repositories

1

reprozip

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.
Python
299
star
2

tile2net

Automated mapping of pedestrian networks from aerial imagery tiles
Python
146
star
3

PipelineVis

Pipeline Profiler is a tool for visualizing machine learning pipelines generated by AutoML tools.
JavaScript
82
star
4

openclean

openclean - Data Cleaning and data profiling library for Python
Python
64
star
5

TaxiVis

Visual Exploration of New York City Taxi Trips
C++
54
star
6

urban-pulse

A standalone version of Urban Pulse
TypeScript
50
star
7

domain_discovery_tool

This repository contains the Domain Discovery Tool (DDT) project. DDT is an interactive system that helps users explore and better understand a domain (or topic) as it is represented on the Web.
JavaScript
46
star
8

data-polygamy

Data Polygamy is a topology-based framework that allows users to query for statistically significant relationships between spatio-temporal data sets.
Java
43
star
9

auctus

Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index
Python
41
star
10

city-surfaces

CitySurfaces semantic segmentation of sidewalk surfaces
Python
36
star
11

shadow-accrual-maps

Accumulated shadow data computed for New York City
Python
26
star
12

domain_discovery_tool_deprecated

Seed acquisition tool to bootstrap focused crawlers
JavaScript
23
star
13

alpha-automl

Alpha-AutoML is a Python library for automatically generating end-to-end machine learning pipelines.
Python
19
star
14

pycalibrate

pycalibrate is a Python library to visually analyze model calibration in Jupyter Notebooks
Jupyter Notebook
16
star
15

reprozip-examples

Examples and demos for ReproZip
HTML
15
star
16

memex

HTML
13
star
17

reproducibility-news

Currated reproducibility news displayed on reproducibility.org
Python
12
star
18

urban-data-study

Python
11
star
19

reproserver

A web application reproducing ReproZip packages in the cloud.
Python
10
star
20

BugDoc

BugDoc: python package to debug computational pipelines
Python
10
star
21

aws_taxi

Sample scripts to analyze taxi data on Amazon AWS
Python
10
star
22

domain-discovery-d4

Data-Driven Domain Discovery for Structured Datasets
Java
10
star
23

raster-join

C++
9
star
24

domain_discovery_API

Domain Discovery Operations API formalizes the human domain discovery process by defining a set of operations that capture the essential tasks that lead to domain discovery on the Web as we have discovered in interacting with the Subject Matter Experts (SME)s.
Python
8
star
25

ARGUS

ARGUS is a visual analytics tool that facilitates multimodal data collection, enables quick user modeling, and allows for retrospective analysis and debugging of historical data generated by the AR sensors and ML models that support task guidance.
TypeScript
7
star
26

openclean-core

Data Cleaning and Data Profiling Library for Python
Python
7
star
27

reproducible-science

Python
6
star
28

genotet

Genotet: An Interactive Web-based Visual Exploration Framework to Support Validation of Gene Regulatory Networks
JavaScript
6
star
29

Urban-Rhapsody

TypeScript
6
star
30

Segmentangling

C
5
star
31

openclean-pattern

Pattern identifier and anomaly detector
Python
5
star
32

birdvis

Source code for the BirdVis project, for more information visit www.birdvis.org
C++
5
star
33

alphad3m

Jupyter Notebook
4
star
34

bdi-kit

A Python toolkit for biomedical data integration
Python
4
star
35

tim-reasoning

Jupyter Notebook
4
star
36

mongodb-vls

MongoDB-VLS is an implementation of VLS (Virtual Lightweight Snapshots) in MongoDB. VLS is a mechanism that enables consistent analytics without blocking incoming updates in NoSQL stores.
C++
4
star
37

urban-data-provider

Download and transform (open urban) data sets from different data provider
Java
3
star
38

SamplingMethodsForInnerProductSketching

Python
3
star
39

openclean-metanome

Python package to run Metanome data profiling algorithms
Python
2
star
40

openclean-notebook

UI for openclean in Jupyter and Colab Notebooks
TypeScript
2
star
41

vida-nyu.github.io

Home page for the group
HTML
2
star
42

BusExplorer

Bus Time Tool: a web-based tool for the exploration of bus trajectory data
JavaScript
2
star
43

pedestrian-sensing-model

Generation of a pedestrian density map using ground-level images.
Python
2
star
44

openclean-geo

Geo-Spatial Data Extension for openclean
Python
2
star
45

prida

PRIDA: Pruning Irrelevant Datasets for Data Augmentation.
Jupyter Notebook
2
star
46

ARIES-Issues

A version of ARIES
2
star
47

city-surfaces-old

2
star
48

reproducible-science-nyu

https://nyu.reproduciblescience.org
Python
2
star
49

Interactive-Visualization-Jupyter-Notebooks

Jupyter Notebook
1
star
50

ptg-api-server

Python
1
star
51

redis-streamer

An API to communicate with redis over websockets
Python
1
star
52

interactivecalibration

Interactive Calibration Plots
Jupyter Notebook
1
star
53

repromatch

Website designed to help you find the tool (or tools) that best matches your reproduciblity needs
HTML
1
star
54

cmu-mmac2epic-kitchens

CMU MMAC 2 Epic Kitchens annotation format
Python
1
star
55

ptg-server-ml

The machine learning model deployment
Jupyter Notebook
1
star
56

python-staticflow

Construct a data flow from static analysis of Python code
Python
1
star
57

minesafe

Minesafe is a Crowdsourcing information system for people in rural areas of countries affected by antipersonnel mines
Java
1
star
58

user-agent-study

Python
1
star
59

wildlife_pipeline

Python
1
star
60

urban-event-detection

Python
1
star
61

artist

Python
1
star
62

urban-data-core

Core functionality and classes for Urban Data Integration project
Java
1
star
63

memex-cdr

Memex Crawl Data Repository Standard
Java
1
star
64

inner-product-sketches

1
star