• This repository has been archived on 26/Jun/2020
  • Stars
    star
    338
  • Rank 124,613 (Top 3 %)
  • Language
    JavaScript
  • Created over 11 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Super configurable async web spider

Install

npm install huntsman --save

NPM

Example Script

/** Crawl wikipedia and use jquery syntax to extract information from the page **/

var huntsman = require('huntsman');
var spider = huntsman.spider();

spider.extensions = [
  huntsman.extension( 'recurse' ), // load recurse extension & follow anchor links
  huntsman.extension( 'cheerio' ) // load cheerio extension
];

// follow pages which match this uri regex
spider.on( /http:\/\/en\.wikipedia\.org\/wiki\/\w+:\w+$/, function ( err, res ){

  // use jquery-style selectors & functions
  var $ = res.extension.cheerio;
  if( !$ ) return; // content is not html

  // extract information from page body
  var wikipedia = {
    uri: res.uri,
    heading: $('h1.firstHeading').text().trim(),
    body: $('div#mw-content-text p').text().trim()
  };

  console.log( wikipedia );

});

spider.queue.add( 'http://en.wikipedia.org/wiki/Huntsman_spider' );
spider.start();

Example Output

peter@edgy:/tmp$ node examples/html.js 
{
  "uri": "http://en.wikipedia.org/wiki/Wikipedia:Recent_additions",
  "heading": "Wikipedia:Recent additions",
  "body": "This is a selection of recently created new articles and greatly expanded former stub articles on Wikipedia that were featured on the Main Page as part of Did you know? You can submit new pages for consideration. (Archives are grouped by month of Main page appearance.)Tip: To find which archive contains the fact that appeared on Did You Know?, return to the article and click \"What links here\" to the left of the article. Then, in the dropdown menu provided for namespace, choose Wikipedia and click \"Go\". When you find \"Wikipedia:Recent additions\" and a number, click it and search for the article name.\n\nCurrent archive"
}

... etc

More examples are available in the /examples directory


How it works

Huntsman takes one or more 'seed' urls with the spider.queue.add() method.

Once the process is kicked off with spider.start(), it will take care of extracting links from the page and following only the pages we want.

To define which pages are crawled use the spider.on() function with a string or regular expression.

Each page will only be crawled once. If multiple regular expressions match the uri, they will all be called.

Page URLs which do not match an on condition will never be crawled


Configuration

The spider has default settings, you can override them by passing a settings object when you create a spider.

// use default settings
var huntsman = require('huntsman');
var spider = huntsman.spider();
// override default settings
var huntsman = require('huntsman');
var spider = huntsman.spider({
  throttle: 10, // maximum requests per second
  timeout: 5000 // maximum gap of inactivity before exiting (in milliseconds)
});

Crawling a site

How you configure your spider will vary from site to site, generally you will only be looking for for pages with a specific url format.

Scrape product information from amazon

In this example we can see that amazon product uris all seem to share the format '/gp/product/'.

After queueing the seed uri http://www.amazon.co.uk/ huntsman will follow all the product pages it finds recursively.

/** Example of scraping products from the amazon website **/

var huntsman = require('huntsman');
var spider = huntsman.spider();

spider.extensions = [
  huntsman.extension( 'recurse' ), // load recurse extension & follow anchor links
  huntsman.extension( 'cheerio' ) // load cheerio extension
];

// target only product uris
spider.on( '/gp/product/', function ( err, res ){

  if( !res.extension.cheerio ) return; // content is not html
  var $ = res.extension.cheerio;

  // extract product information
  var product = {
    uri: res.uri,
    heading: $('h1.parseasinTitle').text().trim(),
    image: $('img#main-image').attr('src'),
    description: $('#productDescription').text().trim().substr( 0, 50 )
  };

  console.log( product );

});

spider.queue.add( 'http://www.amazon.co.uk/' );
spider.start();

Find pets for sale on craigslist in london

More complex crawls may require you to specify hub pages to follow before you can get to the content you really want. You can add an on event without a callback & huntsman will still follow and extract links from it.

/** Example of scraping information about pets for sale on cragslist in london **/

var huntsman = require('huntsman');
var spider = huntsman.spider({
  throttle: 2
});

spider.extensions = [
  huntsman.extension( 'recurse' ), // load recurse extension & follow anchor links
  huntsman.extension( 'cheerio' ), // load cheerio extension
  huntsman.extension( 'stats' ) // load stats extension
];

// target only pet uris
spider.on( /\/pet\/(\w+)\.html$/, function ( err, res ){

  if( !res.extension.cheerio ) return; // content is not html
  var $ = res.extension.cheerio;

  // extract listing information
  var listing = {
    heading: $('h2.postingtitle').text().trim(),
    uri: res.uri,
    image: $('img#iwi').attr('src'),
    description: $('#postingbody').text().trim().substr( 0, 50 )
  };

  console.log( listing );

});

// hub pages
spider.on( /http:\/\/london\.craigslist\.co\.uk$/ );
spider.on( /\/pet$/ );

spider.queue.add( 'http://www.craigslist.org/about/sites' );
spider.start();

Extensions

Extensions have default settings, you can override them by passing an optional second argument when the extension is loaded.

// loading an extension
spider.extensions = [
  huntsman.extension( 'extension_name', options )
];

recurse

This extension extracts links from html pages and then adds them to the queue.

The default patterns only target anchor tags which use the http protocol, you can change any of the default patterns by declaring them when the extension is loaded.

// default patterns
huntsman.extension( 'recurse', {
  pattern: {
    search: /a([^>]+)href\s?=\s?['"]([^"'#]+)/gi,
    refine: /['"]([^"'#]+)$/,
    filter: /^https?:\/\//
  }
})
  • search must be a global regexp and is used to target the links we want to extract.
  • refine is a regexp used to extract the bits we want from the search regex matches.
  • filter is a regexp that must match or links are discarded.
// extract both anchor tags and script tags
huntsman.extension( 'recurse', {
  pattern: {
    search: /(a([^>]+)href|script([^>]+)src)\s?=\s?['"]([^"'#]+)/gi, // <a> or <script>
  }
})
// ignore query segment of uris (exclude everything from '?' onwards)
huntsman.extension( 'recurse', {
  pattern: {
    search: /a([^>]+)href\s?=\s?['"]([^"'#\?]+)/gi // charlist for end of uri [^"'#\?]
  }
})
// avoid some file extensions
huntsman.extension( 'recurse', {
  pattern: {
    filter: /^https?:\/\/.*(?!\.(pdf|png|jpg|gif|zip))....$/i, // regex lookahead
  }
})
// avoid all uris with three letter file extensions
huntsman.extension( 'recurse', {
  pattern: {
    filter: /^https?:\/\/.*(?!\.\w{3})....$/, // exclude three letter file extensions
  }
})
// stay on one domain
huntsman.extension( 'recurse', {
  pattern: {
    filter: /^https?:\/\/www\.example\.com/i, // uris must be prefixed with this domain
  }
})

By default recurse converts relative urls to absolute urls and strips fragment identifiers and trailing slashes.

If you need even more control you can override the resolver & normaliser functions to modify these behaviours.

cheerio

This extension parses html and provides jquery-style selectors & functions.

// default settings
huntsman.extension( 'cheerio', { lowerCaseTags: true } )

The res.extension.cheerio function is available in your on callbacks when the response body is HTML.

spider.on( 'example.com', function ( err, res ){

  // use jquery-style selectors & functions
  var $ = res.extension.cheerio;
  if( !$ ) return; // content is not html

  console.log( res.uri, $('h1').text().trim() );

});

cheerio reference: https://github.com/MatthewMueller/cheerio

json

This extension parses the response body with JSON.parse().

// enable json
huntsman.extension( 'json' )

The res.extension.json function is available in your on callbacks when the response body is json.

spider.on( 'example.com', function ( err, res ){

  var json = res.extension.json;
  if( !json ) return; // content is not json

  console.log( res.uri, json );

});

links

This extension extracts links from html pages and returns the result.

It exposes the same functionality that the recurse extension uses to extract links.

// enable extension
huntsman.extension( 'links' )

The res.extension.links function is available in your on callbacks when the response body is a string.

spider.on( 'example.com', function ( err, res ){

  if( !res.extension.links ) return; // content is not a string

  // extract all image tags from body
  var images = res.extension.links({
    pattern: {
      search: /(img([^>]+)src)\s?=\s?['"]([^"'#]+)/gi, // extract img tags
      filter: /\.jpg|\.gif|\.png/i // filter file types
    }
  });

  console.log( images );

});

stats

This extension displays statistics about pages crawled, error counts etc.

// default settings
huntsman.extension( 'stats', { tail: false } )

Custom queues and response storage adapters

I'm currently working on being able to persist the job queue via something like redis and potentially caching http responses in mongo with a TTL.

If you live life on the wild side, these adapters can be configured when you create a spider.

Pull requests welcome.


Build Status

Build Status

Bitdeli Badge

License

(The MIT License)

Copyright (c) 2013 Peter Johnson <@insertcoffee>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

More Repositories

1

mikrotik-openvpn-client

configure your mikrotik routerboard as an openvpn client
Shell
186
star
2

uk-postcode-polygons

a dump of the UK postcode polygons from wikipedia in KML and GeoJSON format
Shell
86
star
3

leaflet-spatial-prefix-tree

Leaflet plugin for visualizing spatial prefix trees, quadtree and geohash
JavaScript
53
star
4

ciao

Ciao is a simple command line utility for testing http(s) requests and generating API documentation
CoffeeScript
29
star
5

pbf

utilities for parsing OpenStreetMap PBF files and extracting geographic data
Go
21
star
6

s2js

Javascript port of the S2 Geometry lib for {browser, node}
TypeScript
19
star
7

relief.io

Typhoon Haiyan Support Hack
JavaScript
17
star
8

osm-boundaries

data dump of openstreetmap boundaries in geojson format
16
star
9

passport-facebook-canvas

Facebook Canvas authentication strategy for Passport and Node.js.
JavaScript
15
star
10

netstat.js

nodejs binding for netstat, just in case you felt the need to see what your network is up to.
JavaScript
12
star
11

Insomnia

Insomnia - PHP 5.3 MVC framework for creating RESTful web services, built on Doctrine2 components
PHP
11
star
12

supervisord

PHP frontend for Supervisord daemon manager
PHP
8
star
13

wof-spatialite

docker container to load whosonfirst in to spatialite.
Shell
6
star
14

mongrel2-php

Simple library for writing Mongrel2 clients in PHP 5.3+ using zmq zeromq 2.2
PHP
6
star
15

naivedb

minimal viable database
JavaScript
4
star
16

gis

latest and greatest compiled versions of GIS tools on ubuntu
Shell
2
star
17

pbflint

openstreetmap pbf file validator
Go
2
star
18

pipeline

distributed non-buffering data pipeline with built in orchestrator and flood control (alpha)
JavaScript
2
star
19

Redis

PHP
2
star
20

regexemitter

a regex event emitter
CoffeeScript
2
star
21

dmon

A daemon manager and service discovery module for nodejs.
JavaScript
2
star
22

urthecast

urthecast js client
JavaScript
2
star
23

sockpuppet

Access raw UDP/TCP sockets from the browser
JavaScript
2
star
24

elastictest

A simple test harness for elasticsearch functional testing
JavaScript
1
star
25

dbuz

cli wrapper for dbus
Go
1
star
26

link-checker

CoffeeScript
1
star
27

Kava

Insomnia v2
PHP
1
star
28

through2-benchmark

simple benchmarking to check the performance of a through2 stream
JavaScript
1
star
29

CareerGuiSe

CGS - Summer of Tech Career Guidance System
JavaScript
1
star
30

dubstep

A simple two step view pattern for expressjs
CoffeeScript
1
star
31

checkins

Display all your foursquare checkins on a map
JavaScript
1
star
32

whosonthecouch

import whosonfirst data in to couchdb
1
star
33

photon

Unofficial mirror of a PHP 5.3+ application server for Mongrel2
PHP
1
star