• Stars
    star
    258
  • Rank 154,971 (Top 4 %)
  • Language
    Clojure
  • License
    Eclipse Public Li...
  • Created almost 9 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🐎✈️ Pegasus is a scalable, modular, polite web-crawler for Clojure

pegasus

Circle CI

Pegasus is a highly-modular, durable and scalable crawler for clojure.

Parallelism is achieved with core.async. Durability is achieved with durable-queue and LMDB.

A blog post on how pegasus works: [link]

Usage

Leiningen dependencies:

Clojars Project

A few example crawls:

This one crawls 20 docs from my blog (http://blog.shriphani.com).

URLs are extracted using enlive selectors.

(ns pegasus.foo
  (:require [pegasus.core :refer [crawl]]
            [pegasus.dsl :refer :all])
  (:import (java.io StringReader)))

(defn crawl-sp-blog
  []
  (crawl {:seeds ["http://blog.shriphani.com"]
          :user-agent "Pegasus web crawler"
          :corpus-size 20 ;; crawl 20 documents
          :job-dir "/tmp/sp-blog-corpus"})) ;; store all crawl data in /tmp/sp-blog-corpus/

(defn crawl-sp-blog-custom-extractor
  []
  (crawl {:seeds ["http://blog.shriphani.com"]
          :user-agent "Pegasus web crawler"
          :extractor (defextractors
                       (extract :at-selector [:article :header :h2 :a]

                                :follow :href

                                :with-regex #"blog.shriphani.com")
                       
                       (extract :at-selector [:ul.pagination :a]

                                :follow :href
                                
                                :with-regex #"blog.shriphani.com"))
          
          :corpus-size 20 ;; crawl 20 documents
          :job-dir "/tmp/sp-blog-corpus"}))

Say you want more control and want to avoid the DSL, you can use the underlying machinery directly. Here's an example using XPaths to extract links.

(ns your.namespace
  (:require [org.bovinegenius.exploding-fish :as uri]
            [net.cgrand.enlive-html :as html]
            [pegasus.core :refer [crawl]]
            [clj-xpath.core :refer [$x $x:text xml->doc]]))

(deftype XpathExtractor []
  process/PipelineComponentProtocol
  
  (initialize
    [this config]
    config)
  
  (run
    [this obj config]
    (when (= "blog.shriphani.com"
             (-> obj :url uri/host))
      
      (let [url (:url obj)
            resource (try (-> obj
                              :body
                              xml->doc)
                          (catch Exception e nil))
            
            ;; extract the articles
            articles (map
                      :text
                      (try ($x "//item/link" resource)
                           (catch Exception e nil)))]
        
        ;; add extracted links to the supplied object
        (merge obj
               {:extracted articles}))))

  (clean
    [this config]
    nil))

(defn crawl-sp-blog-xpaths
  []
  (crawl {:seeds ["http://blog.shriphani.com/feeds/all.rss.xml"]
          :user-agent "Pegasus web crawler"
          :extractor (->XpathExtractor)
          
          :corpus-size 20 ;; crawl 20 documents
          :job-dir "/tmp/sp-blog-corpus"}))

;; start crawling
(crawl-sp-blog-xpaths)          

License

Copyright © 2015-2018 Shriphani Palakodety

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

More Repositories

1

Listener

Detect calls of attention in the surroundings
Python
52
star
2

clj-lmdb

Clojure wrapper for lmdb
Clojure
36
star
3

subotai

Subotai brings routines for extracting information from HTML documents to clojure
Clojure
25
star
4

fort-knox

A disk-backed core.cache implementation based on LMDB
Clojure
23
star
5

sleipnir

A simple, performant web-crawler for clojure
Clojure
17
star
6

clojure-manifold

Manifold learning algorithms in clojure
Clojure
15
star
7

polyglot-toolbox

Polyglot skipgram embeddings, and their many health benefits
Python
11
star
8

vad_python

A solid VAD in Python
Python
9
star
9

VAD-py

Webrtc VAD in Python
C
9
star
10

JPredict

Applying ML Techniques to Predict Drawn Japanese Characters. Currently Hiragana is implemented
C#
8
star
11

robust_pcp

Robust Principal Component Pursuit
Python
7
star
12

clojure_scraping_overview

XPath and enlive
Clojure
7
star
13

tinywm-rkt

TinyWM Implementation in Racket
Racket
6
star
14

tree-edit-distance

An implementation of a tree-edit-distance algorithm for structure-based clustering in clojure
Clojure
5
star
15

kublai

Truncated matrix decompositions for core.matrix
Clojure
4
star
16

sutime-clojure

A wrapper around the Time NER Tagger in Stanford Core NLP Suite.
Clojure
3
star
17

enlive-helper

A more powerful html-resource for use with enlive's functions
Clojure
3
star
18

clj-heritrix

Clojure implementation of the heritrix REST API
Clojure
2
star
19

structural_similarity

Compare html documents for similarity in structure (or template)
Clojure
2
star
20

probabilistic-counting

Cardinality estimation algorithms in clojure
Clojure
2
star
21

crawler

ephemeral content finder
Clojure
2
star
22

clj-named-leveldb

named databases for leveldb using one simple hack they don't want you to know
Clojure
1
star
23

trec

Trec Federated Search Track
Python
1
star
24

satcharitra

Clojure
1
star
25

pegasus-examples

Pegasus Examples
Clojure
1
star
26

racket-whistlepig

Racket bindings to the whistlepig engine
Racket
1
star
27

clj-spectral

Spectral algorithms in clojure targeting core.matrix
Clojure
1
star
28

pgm-indian-buffet-process

Scribe Notes for CMU 10-708 Lecture on Indian Buffer Process
1
star
29

clojure-kindle-highlights

Scrape the kindle highlights webpage and download the highlights for a book from there.
Clojure
1
star
30

geojson3d

3d Render GeoJsons
JavaScript
1
star
31

heritrix-clojure

Heritrix API implementation in clojure (a bit of a kludge at the moment)
Clojure
1
star
32

index-page-crawler

Follow pagination and get pages
Clojure
1
star
33

web-corpus

Clueweb web corpus pipeline
Clojure
1
star
34

india_in_data

India in data source, datasets, materials
1
star
35

consistent-hashing

Consistent hashing implementation in clojure
Java
1
star
36

clj-dimension

Algorithms to study and reduce dimensions of datasets
Clojure
1
star
37

warc-clojure

Clojure wrapper around a Java library to read warc files.
Clojure
1
star