• Stars
    star
    404
  • Rank 106,897 (Top 3 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

Sparkler

Slack

Open in Gitpod

A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

NOTE:

Sparkler is being proposed to Apache Incubator. Review the proposal document and provide your suggestions here here Will be done later, eventually!

Notable features of Sparkler:

  • Provides Higher performance and fault tolerance: The crawl pipeline has been redesigned to take advantage of the caching and fault tolerance capability of Apache Spark.
  • Supports complex and near real-time analytics: The internal data-structure is an indexed store powered by Apache Lucene and has the functionality to answer complex queries in near real time. Apache Solr (Supporting standalone for a quick start and cloud mode to scale horizontally) is used to expose the crawler analytics via HTTP API. These analytics can be visualized using intuitive charts in Admin dashboard (coming soon).
  • Streams out the content in real-time: Optionally, Apache Kafka can be configured to retrieve the output content as and when the content becomes available.
  • Java Script Rendering Executes the javascript code in webpages to create final state of the page. The setup is easy and painless, scales by distributing the work on Spark. It preserves the sessions and cookies for the subsequent requests made to a host.
  • Extensible plugin framework: Sparkler is designed to be modular. It supports plugins to extend and customize the runtime behaviour.
  • Universal Parser: Apache Tika, the most popular content detection, and content analysis toolkit that can deal with thousands of file formats, is used to discover links to the outgoing web resources and also to perform analysis on fetched resources.

Quick Start: Running your first crawl job in minutes

To use sparkler, install docker and run the below commands:

# Step 0. Get the image
docker pull ghcr.io/uscdatascience/sparkler/sparkler:main
# Step 1. Create a volume for elastic
docker volume create elastic
# Step 1. Inject seed urls
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main inject -id myid -su 'http://www.bbc.com/news'
# Step 3. Start the crawl job
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main crawl -id myid -tn 100 -i 2     # id=1, top 100 URLs, do -i=2 iterations

Running Sparkler with seed urls file:

1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:     
       a. emacs sparkler/bin/seed-urls.txt 
       b. copy paste your urls 
       c. Ctrl+x Ctrl+s to save  
       d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]

* Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command.

3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) 
$bash sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.

To crawl until the end of all new URLS, use -i -1, Example: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1

Making Contributions:

Contact Us

Any questions or suggestions are welcomed in our mailing list [email protected] Alternatively, you may use the slack channel for getting help http://irds.usc.edu/sparkler/#slack

More Repositories

1

supervising-ui

Web UI for labelling dataset for supervised learning.
Python
78
star
2

Image-Similarity-Deep-Ranking

Deep Ranking based ImageSimilarity will be developed as plugin on ImageSpace. https://users.eecs.northwestern.edu/~jwa368/pdfs/deep_ranking.pdf
Python
36
star
3

SentimentAnalysisParser

Combines Apache OpenNLP and Apache Tika and provides facilities for automatically deriving sentiment from text.
31
star
4

dl4j-kerasimport-examples

This repository contains deeplearning4j examples for importing and making use of models trained in keras
Java
27
star
5

NLTKRest

This is a REST Server endpoint built using Flask and Python.
Java
24
star
6

tika-dockers

A suite of Machine Learning / Deep Learning Dockerfiles to allow Apache Tika to extract objects and to produce textual captions for images and video
21
star
7

polar.usc.edu

Polar USC activities related to NSF Polar CyberInfrastructure program at the University of Southern California
HTML
15
star
8

AgePredictor

Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum
Java
15
star
9

polar-deep-insights

Conceptual - Temporal - Spatial analysis of the trec polar dataset
JavaScript
10
star
10

hadoop-pot

A scalable Apache Hadoop-based implementation of the Pooled Time Series video similarity algorithm based on M. Ryoo et al paper CVPR 2015.
Java
10
star
11

uscdatascience.github.io

USC Information Retrieval and Data Science Group
HTML
9
star
12

parser-indexer-py

Python tools for parsing documents and building the inverted index with enriched metadata. Java version with slightly different features - https://github.com/USCDataScience/parser-indexer
Jupyter Notebook
9
star
13

video-recognition

Python
8
star
14

TextREST.jl

Language Detection REST Server using MIT Lincoln Lab’s Text.jl library
Julia
7
star
15

cmu-fg-bg-similarity

CMU Foreground/Background Similarity Server from DARPA MEMEX
C++
6
star
16

img2text

Models, and associated helper code for GSOC 2017 project Tensorflow Image to Text in Apache Tika
Python
6
star
17

counterfeit-electronics-tesseract

Training Tesseract to better extract serial numbers from images of electronic items
Java
6
star
18

svm-classifier-memex

Java
6
star
19

ufo.usc.edu

Collection of projects from IRDS students studying unidentified flying objects
HTML
6
star
20

pdi-topics

LDA Topic Modeling for Polar Data Insights
HTML
5
star
21

deepsentirank

Deep Learning based Sentiment Ranking for Multimedia
Python
5
star
22

file-content-analyzer

A set of python modules to perform Byte Frequency Analysis, Byte Frequency Correlation, Cross Correlation and FHT analysis on files
Python
5
star
23

PersonaExtraction

Java
4
star
24

imagecat2

Imagecat Version 2
XSLT
4
star
25

nutch-analytics

Nutch Crawl Analysis - Spark based project
Scala
4
star
26

memex-cca-esindex

Python
3
star
27

TrojanFootball

Analyses athletes past performance and workload for a better training
Java
2
star
28

counterfeit-crawling

Focused Crawling and Evaluation of Counterfeit Electronics Sites
Python
2
star
29

tika-dl-models

A place to release saved machine learning models for tika-dl
2
star
30

sparkler-jsdriver

Java
1
star
31

file-content-visualizer

Visualizations for Byte frequency analysis, Byte frequency correlation, Byte frequency cross-correlation and FHT.
CSS
1
star
32

sparkler-ui

JavaScript
1
star
33

PlanetaryIR

Information Retrieval for Planetary Science using DeepDive
Shell
1
star