• Stars
    star
    181
  • Rank 205,801 (Top 5 %)
  • Language
    Clojure
  • Created about 12 years ago
  • Updated almost 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A threaded web-spider written in Clojure

Itsy

A threaded web spider, written in Clojure.

Usage

In your project.clj:

[itsy "0.1.1"]

In your project:

(ns myns.foo
  (:require [itsy.core :refer :all]))

(defn my-handler [{:keys [url body]}]
  (println url "has a count of" (count body)))

(def c (crawl {;; initial URL to start crawling at (required)
               :url "http://aoeu.com"
               ;; handler to use for each page crawled (required)
               :handler my-handler
               ;; number of threads to use for crawling, (optional,
               ;; defaults to 5)
               :workers 10
               ;; number of urls to spider before crawling stops, note
               ;; that workers must still be stopped after crawling
               ;; stops. May be set to -1 to specify no limit.
               ;; (optional, defaults to 100)
               :url-limit 100
               ;; function to use to extract urls from a page, a
               ;; function that takes one argument, the body of a page.
               ;; (optional, defaults to itsy's extract-all)
               :url-extractor extract-all
               ;; http options for clj-http, (optional, defaults to
               ;; {:socket-timeout 10000 :conn-timeout 10000 :insecure? true})
               :http-opts {}
               ;; specifies whether to limit crawling to a single
               ;; domain. If false, does not limit domain, if true,
               ;; limits to the same domain as the original :url, if set
               ;; to a string, limits crawling to the hostname of the
               ;; given url
               :host-limit false
               ;; polite crawlers obey robots.txt directives
               ;; by default this crawler is polite
               :polite? true}))

;; ... crawling ensues ...

(thread-status c)
;; returns a map of thread-id to Thread.State:
{33 #<State RUNNABLE>, 34 #<State RUNNABLE>, 35 #<State RUNNABLE>,
 36 #<State RUNNABLE>, 37 #<State RUNNABLE>, 38 #<State RUNNABLE>,
 39 #<State RUNNABLE>, 40 #<State RUNNABLE>, 41 #<State RUNNABLE>,
 42 #<State RUNNABLE>}

(add-worker c)
;; adds an additional thread worker to the pool

(remove-worker c)
;; removes a worker from the pool

(stop-workers c)
;; stop-workers will return a collection of all threads it failed to
;; stop (it should be able to stop all threads unless something goes
;; very wrong)

Upon completion, c will contain state that allows you to see what happened:

(clojure.pprint/pprint (:state c))
;; URLs still in the queue
{:url-queue #<LinkedBlockingQueue []>,
;; URLs that were seen/queued
 :url-count #<Atom@67d6b87e: 2>,
 ;; running worker threads (will contain thread objects while crawling)
 :running-workers #<Ref@decdc7b: []>,
 ;; canaries for running worker threads
 :worker-canaries #<Ref@397f1661: {}>,
 ;; a map of URL to times seen/extracted from the body of a page
 :seen-urls
 #<Atom@469657c4:
   {"http://www.phpbb.com" 1,
    "http://pagead2.googlesyndication.com/pagead/show_ads.js" 2,
    "http://www.subBlue.com/" 1,
    "http://www.phpbb.com/" 1,
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" 1,
    "http://www.w3.org/1999/xhtml" 1,
    "http://forums.asdf.com" 1,
    "http://www.google.com/images/poweredby_transparent/poweredby_000000.gif" 1,
    "http://asdf.com" 1,
    "http://www.google.com/cse/api/branding.css" 1,
    "http://www.google.com/cse" 1}>}

Features

  • Multithreaded, with the ability to add and remove workers as needed
  • No global state, run multiple crawlers with multiple threads at once
  • Pre-written handlers for text files and ElasticSearch
  • Skips URLs that have been seen before
  • Domain limiting to crawl pages only belonging to a certain domain

Included handlers

Itsy includes handlers for common actions, either to be used, or examples for writing your own.

Text file handler

The text file handler stores web pages in text files. It uses the html->str method in itsy.extract to convert HTML documents to plain text (which in turn uses Tika to extract HTML to plain text).

Usage:

(ns bar
  (:require [itsy.core :refer :all]
            [itsy.handlers.textfiles :refer :all]))

;; The directory will be created when the handler is created if it
;; doesn't already exist
(def txt-handler (make-textfile-handler {:directory "/mnt/data" :extension ".txt"}))

(def c (crawl {:url "http://example.com" :handler txt-handler}))

;; then look in the /mnt/data directory

ElasticSearch handler

The elasticsearch handler stores documents with the following mapping:

{:id {:type "string"
      :index "not_analyzed"
      :store "yes"}
 :url {:type "string"
       :index "not_analyzed"
       :store "yes"}
 :body {:type "string"
        :store "yes"}}

Usage:

(ns foo
  (:require [itsy.core :refer :all]
            [itsy.handlers.elasticsearch :refer :all]))

;; These are the default settings
(def index-settings {:settings
                     {:index
                      {:number_of_shards 2
                       :number_of_replicas 0}}})

;; If the ES index doesn't exist, make-es-handler will create it when called.
(def es-handler (make-es-handler {:es-url "http://localhost:9200/"
                                  :es-index "crawl"
                                  :es-type "page"
                                  :es-index-settings index-settings
                                  :http-opts {}}))

(def c (crawl {:url "http://example.com" :handler es-handler}))

;; ... crawling and indexing ensues ...

Todo

  • Relative URL extraction/crawling
  • Always better URL extraction
  • Handlers for common body actions
    • elasticsearch
    • text files
    • other?
  • Helpers for dynamically raising/lowering thread count
  • Timed crawling, have threads clean themselves up after a limit
  • Have threads auto-clean when url-limit is hit
  • Use Tika for HTML extraction
  • Write tests

License

Copyright © 2012 Lee Hinman

Distributed under the Eclipse Public License, the same as Clojure.

More Repositories

1

clj-http

An idiomatic clojure http client wrapping the apache client. Officially supported version.
Clojure
1,759
star
2

cheshire

Clojure JSON and JSON SMILE (binary json format) encoding/decoding
Clojure
1,474
star
3

clojure-opennlp

Natural Language Processing in Clojure (opennlp)
Clojure
747
star
4

elasticsearch-in-action

Offical code repository for the Elasticsearch in Action book from Manning
Shell
370
star
5

eos

Welcome to the Emacs of Things, aka the Emacs Operating System
Emacs Lisp
261
star
6

es-mode

An Emacs major mode for interacting with Elasticsearch
Emacs Lisp
192
star
7

lein-bikeshed

A Leiningen plugin designed to tell you your code is bad, and that you should feel bad
Clojure
177
star
8

ox-tufte

Emacs' Org-mode export backend for Tufte HTML
Emacs Lisp
98
star
9

dakrone-dotfiles

misc configuration files
Vim Script
89
star
10

cld

Language detection for Clojure
Clojure
42
star
11

emacs-java-imports

Add java imports easily in Emacs
Emacs Lisp
40
star
12

clojuredocs-client

A tiny client for the http://clojuredocs.org API
Clojure
36
star
13

ricepaper

Simple library and CLI tool for adding URLs to InstaPaper
Ruby
35
star
14

nsm-console

Network Security Monitoring Console
Ruby
23
star
15

forkify

Do work from a pool of processes using forks. Like threadify with processes.
Ruby
18
star
16

lein-autotest

A Leiningen plugin to start Lazytest's autowatch
Clojure
17
star
17

fastri

Fastri, now with 1.9 support
Ruby
13
star
18

cadastre

Survey a clojure project and extract valuable metadata
Clojure
13
star
19

dakrone-theme

dakrone's custom emacs color theme
Emacs Lisp
11
star
20

dakrone-light-theme

Dakrone's custom light Emacs theme
Emacs Lisp
11
star
21

dakrone.github.com

webness
HTML
10
star
22

eisago

Next-gen clojuredocs importer, API, and website.
Clojure
9
star
23

syndicate

Fun with NLP and synonyms.
Clojure
9
star
24

tigris

Stream-to-stream JSON string escaping
Java
7
star
25

lein-clojuredocs

Generate data about your project to submit to clojuredocs.org
Clojure
7
star
26

lids

Locality Intrusion Detection System
C++
6
star
27

elasticsearch-clojure-plugin

A proof-of-concept Elasticsearch plugin written entirely in Clojure
Clojure
6
star
28

denverclojure

Denver Clojure meetup website
Clojure
5
star
29

rcapr

A Ruby library to interact with the pcapr website (http://pcapr.net)
Ruby
4
star
30

integrity-cap-notifier

Notifier for integrity that does capistrano deployments when a build passes
Ruby
4
star
31

clomoios

Context searching using NLP magic
Clojure
4
star
32

one-offs

Simple-file projects, one-offs and miscellaneous stuff.
Ruby
3
star
33

skyyy

A small ruby script to deal with Skype over dbus
Ruby
3
star
34

ruby-datasuite

A simple interactive/scriptable random data generation and verification test tool.
3
star
35

cbench

A small clojure benchmarking helper
Clojure
3
star
36

nile

Stream utilities for everyday Clojure use
Clojure
3
star
37

labview_rails

A lab machine tracking system using rails (for sysadmins)
Ruby
3
star
38

biblybot

Bibly bot is a Google Wave robot for whatever I feel like
Clojure
3
star
39

corpus

a tool used to train a detokenization library
Clojure
2
star
40

churn

Churn data given a directory and change rate
Java
2
star
41

jsontest

testing speeds of clojure json libs
Clojure
2
star
42

elasticsearch-nrepl

Embedded nREPL in ElasticSearch
Java
2
star
43

felix

A handy garbage monitor for monitoring when clojure objects are GC'd
Clojure
2
star
44

clj-http-async

clj-http, but with Apache's async client instead
Clojure
2
star
45

metube

Youtube downloading as an immutant service
Clojure
2
star
46

clj2010

Stats from #clojure, forked from https://bitbucket.org/tebeka/clj2010/overview
1
star
47

.emacs.d

My Emacs init.
Emacs Lisp
1
star
48

norad

SQS Message consumption for Immutant queues
Clojure
1
star
49

download-test

don't look
1
star
50

cd-client

clojuredocs API client
1
star
51

chrojos

a clojure library for parsing irrational datetime strings
Clojure
1
star
52

syn

Don't look at me yet
Clojure
1
star
53

gh-upload

Github file uploader
Clojure
1
star
54

recipi

Personal web server for storing recipes
Clojure
1
star
55

chunktest

test
Clojure
1
star
56

screamy

Immutant queue notifications
Clojure
1
star
57

org-criterium

Work in progress
Clojure
1
star
58

ars-capture

Periodically capture adaptive replica stats and store them in a different ES cluster
Clojure
1
star