• Stars
    star
    5,866
  • Rank 6,905 (Top 0.2 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created almost 10 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The next web scraper. See through the <html> noise.

x-ray

Last version Build Status Coverage Status Dependency status Dev Dependencies Status NPM Status Node version OpenCollective OpenCollective Gitter

var Xray = require('x-ray')
var x = Xray()

x('https://blog.ycombinator.com/', '.post', [
  {
    title: 'h1 a',
    link: '.article-title@href'
  }
])
  .paginate('.nav-previous a@href')
  .limit(3)
  .write('results.json')

Installation

npm install x-ray

Features

  • Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.

  • Composable: The API is entirely composable, giving you great flexibility in how you scrape each page.

  • Pagination support: Paginate through websites, scraping each page. X-ray also supports a request delay and a pagination limit. Scraped pages can be streamed to a file, so if there's an error on one page, you won't lose what you've already scraped.

  • Crawler support: Start on one page and move to the next easily. The flow is predictable, following a breadth-first crawl through each of the pages.

  • Responsible: X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly.

  • Pluggable drivers: Swap in different scrapers depending on your needs. Currently supports HTTP and PhantomJS driver drivers. In the future, I'd like to see a Tor driver for requesting pages through the Tor network.

Selector API

xray(url, selector)(fn)

Scrape the url for the following selector, returning an object in the callback fn. The selector takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes is selector@attribute. If you do not supply an attribute, the default is selecting the innerText.

Here are a few examples:

  • Scrape a single tag
xray('http://google.com', 'title')(function(err, title) {
  console.log(title) // Google
})
  • Scrape a single class
xray('http://reddit.com', '.content')(fn)
  • Scrape an attribute
xray('http://techcrunch.com', 'img.logo@src')(fn)
  • Scrape innerHTML
xray('http://news.ycombinator.com', 'body@html')(fn)

xray(url, scope, selector)

You can also supply a scope to each selector. In jQuery, this would look something like this: $(scope).find(selector).

xray(html, scope, selector)

Instead of a url, you can also supply raw HTML and all the same semantics apply.

var html = '<body><h2>Pear</h2></body>'
x(html, 'body', 'h2')(function(err, header) {
  header // => Pear
})

API

xray.driver(driver)

Specify a driver to make requests through. Available drivers include:

  • request - A simple driver built around request. Use this to set headers, cookies or http methods.
  • phantom - A high-level browser automation library. Use this to render pages or when elements need to be interacted with, or when elements are created dynamically using javascript (e.g.: Ajax-calls).

xray.stream()

Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here's an example with Express:

var app = require('express')()
var x = require('x-ray')()

app.get('/', function(req, res) {
  var stream = x('http://google.com', 'title').stream()
  stream.pipe(res)
})

xray.write([path])

Stream the results to a path.

If no path is provided, then the behavior is the same as .stream().

xray.then(cb)

Constructs a Promise object and invoke its then function with a callback cb. Be sure to invoke then() at the last step of xray method chaining, since the other methods are not promisified.

x('https://dribbble.com', 'li.group', [
  {
    title: '.dribbble-img strong',
    image: '.dribbble-img [data-src]@data-src'
  }
])
  .paginate('.next_page@href')
  .limit(3)
  .then(function(res) {
    console.log(res[0]) // prints first result
  })
  .catch(function(err) {
    console.log(err) // handle error in promise
  })

xray.paginate(selector)

Select a url from a selector and visit that page.

xray.limit(n)

Limit the amount of pagination to n requests.

xray.abort(validator)

Abort pagination if validator function returns true. The validator function receives two arguments:

  • result: The scrape result object for the current page.
  • nextUrl: The URL of the next page to scrape.

xray.delay(from, [to])

Delay the next request between from and to milliseconds. If only from is specified, delay exactly from milliseconds.

var x = Xray().delay('1s', '10s')

xray.concurrency(n)

Set the request concurrency to n. Defaults to Infinity.

var x = Xray().concurrency(2)

xray.throttle(n, ms)

Throttle the requests to n requests per ms milliseconds.

var x = Xray().throttle(2, '1s')

xray.timeout (ms)

Specify a timeout of ms milliseconds for each request.

var x = Xray().timeout(30)

Collections

X-ray also has support for selecting collections of tags. While x('ul', 'li') will only select the first list item in an unordered list, x('ul', ['li']) will select all of them.

Additionally, X-ray supports "collections of collections" allowing you to smartly select all list items in all lists with a command like this: x(['ul'], ['li']).

Composition

X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:

Crawling to another site

var Xray = require('x-ray')
var x = Xray()

x('http://google.com', {
  main: 'title',
  image: x('#gbar a@href', 'title') // follow link to google images
})(function(err, obj) {
  /*
  {
    main: 'Google',
    image: 'Google Images'
  }
*/
})

Scoping a selection

var Xray = require('x-ray')
var x = Xray()

x('http://mat.io', {
  title: 'title',
  items: x('.item', [
    {
      title: '.item-content h2',
      description: '.item-content section'
    }
  ])
})(function(err, obj) {
  /*
  {
    title: 'mat.io',
    items: [
      {
        title: 'The 100 Best Children\'s Books of All Time',
        description: 'Relive your childhood with TIME\'s list...'
      }
    ]
  }
*/
})

Filters

Filters can specified when creating a new Xray instance. To apply filters to a value, append them to the selector using |.

var Xray = require('x-ray')
var x = Xray({
  filters: {
    trim: function(value) {
      return typeof value === 'string' ? value.trim() : value
    },
    reverse: function(value) {
      return typeof value === 'string'
        ? value
            .split('')
            .reverse()
            .join('')
        : value
    },
    slice: function(value, start, end) {
      return typeof value === 'string' ? value.slice(start, end) : value
    }
  }
})

x('http://mat.io', {
  title: 'title | trim | reverse | slice:2,3'
})(function(err, obj) {
  /*
  {
    title: 'oi'
  }
*/
})

Examples

In the Wild

  • Levered Returns: Uses x-ray to pull together financial data from various unstructured sources around the web.

Resources

Backers

Support us with a monthly donation and help us continue our activities. [Become a backer]

Sponsors

Become a sponsor and get your logo on our website and on our README on Github with a link to your site. [Become a sponsor]

License

MIT

More Repositories

1

date

Date() for humans
JavaScript
1,474
star
2

joy

A delightful Go to Javascript compiler (ON HOLD)
Go
1,325
star
3

array

A better array for the browser and node.js. Supports events & many functional goodies.
JavaScript
709
star
4

graph.ql

Faster and simpler way to create GraphQL servers
JavaScript
638
star
5

socrates

Small (8kb), batteries-included redux store to reduce boilerplate and promote good habits.
JavaScript
578
star
6

dots

WIP bootstrapping library for osx & ubuntu (and maybe others!)
Shell
545
star
7

next-cookies

Tiny little function for getting cookies on both client & server with next.js.
JavaScript
369
star
8

coderunner

Run server-side code quickly and securely in the browser.
JavaScript
327
star
9

28kb-react-redux-routing

React + Redux + Routing Stack for just 28kb
JavaScript
245
star
10

vo

Minimalist, yet complete control flow library.
JavaScript
235
star
11

roo

Jump-start your front-end server
JavaScript
104
star
12

component-test

Minimal configuration component test runner supporting browser testing, phantomjs, and saucelabs.
JavaScript
98
star
13

mini-html-parser

Mini HTML parser for webworkers / node. Intended for well-formed HTML.
JavaScript
83
star
14

node-nom

Dead simple site scrapper for Node.js
JavaScript
74
star
15

outliers

Find outliers in a dataset.
JavaScript
56
star
16

next-redirect

Redirect for next.js. Works on both the client and server
JavaScript
52
star
17

try-again

Generic, simple retry module with exponential backoff.
JavaScript
52
star
18

PHPUnit-Test-Report

Browser testing with PHPUnit
PHP
45
star
19

x-ray-crawler

Friendly web crawler for x-ray
JavaScript
44
star
20

pg-bridge

Simple service connecting PostgreSQL notifications to Amazon SNS.
Go
44
star
21

go-datadog

Easily send structured logs to Datadog over TCP.
Go
39
star
22

svg

low-level svg helper
JavaScript
37
star
23

preact-head

Standalone, declarative <Head /> for Preact.
JavaScript
37
star
24

adjust

Position elements next to each other. A light-weight version of HubSpot/tether.
JavaScript
36
star
25

wrap-fn

Low-level wrapper to easily support sync, async, and generator functions.
JavaScript
34
star
26

dom-iterator

Feature-rich, well-tested Iterator for traversing DOM nodes.
JavaScript
34
star
27

normalize-contenteditable

All text in a content-editable block should be wrapped in <p> tag.
JavaScript
34
star
28

ppi

Find the PPI (pixels per inch) of an image.
JavaScript
33
star
29

next-route

Simplified custom routing for next.js.
JavaScript
33
star
30

tipp

Tool tips that just work.
JavaScript
31
star
31

combine-errors

Simple way to combine multiple errors into one.
JavaScript
31
star
32

poss

Slightly better-looking error handling for async/await & generators
JavaScript
28
star
33

autocomplete

Flexible autocomplete component
JavaScript
26
star
34

envobj

Tiny environment variable helper, that I'll use in all my apps.
TypeScript
25
star
35

qr-code

Create QR codes
JavaScript
25
star
36

vcom

Everything you need to create virtual Preact Components with CSS, HTML, and JS.
JavaScript
21
star
37

tiny-store

Tiny immutable store for any value
JavaScript
21
star
38

blocktree

Back to the basics, Hickey-inspired, generic text parser
JavaScript
21
star
39

unmatrix

Parse and normalize the individual values of a css transform
JavaScript
21
star
40

enqueue

seamlessly queue up asynchronous function calls. supports concurrency and timeouts.
JavaScript
20
star
41

string-scanner

scan through strings. supports forwards and backwards scanning.
JavaScript
19
star
42

step.js

My kind of step library. no dependencies. 120 lines of code. 383 lines of tests.
JavaScript
18
star
43

every

human-friendly intervals using http://github.com/matthewmueller/date
JavaScript
17
star
44

json-to-dom

Fill in DOM nodes with JSON. Supports arrays and attributes.
JavaScript
17
star
45

pretty-html

HTML logging that's easy on the eyes.
JavaScript
17
star
46

preact-socrates

preact plugin for socrates.
JavaScript
16
star
47

time-series

simple streaming time series graphs. automatic rescaling as data streams in.
JavaScript
16
star
48

event-debugger

step through events! must be initialized at the top of your scripts.
JavaScript
16
star
49

sun

Simple little virtual DOM node builder for Preact.
JavaScript
15
star
50

x-ray-parse

x-ray's selector parser.
JavaScript
15
star
51

file-pipe

Use gulp plugins on individual files
JavaScript
14
star
52

title-capitalization

Properly capitalize English titles.
JavaScript
14
star
53

atom-standard

An on-save linter and formatter for atom using standard. Supports all the options that standard supports.
JavaScript
14
star
54

express-graph.ql

Express middleware for querying our graphql server built with graph.ql
JavaScript
13
star
55

murmur.js

Small murmur hash implementation.
JavaScript
13
star
56

mergin

Merges files together using a best-effort merge
JavaScript
13
star
57

next-flash

Flash messages for next.js. Works on both the client and the server.
JavaScript
13
star
58

image-search

Pluggable image search
JavaScript
13
star
59

redux-routes

Simple redux history middleware.
JavaScript
13
star
60

io

higher-level engine.io client.
JavaScript
12
star
61

stripe-checkout

Open Stripe Checkout programmatically
JavaScript
12
star
62

remember

Use localstorage to remember input values. Supports textareas and inputs including radio buttons and checkboxes.
JavaScript
12
star
63

internal-old

Internal queue for your public libraries and APIs
JavaScript
11
star
64

preact-rc

Remote control your Preact components
JavaScript
11
star
65

subs

tiny string substitution
JavaScript
10
star
66

lambda-serve

Use koa or express on lambda!
JavaScript
10
star
67

grow

Grow textareas without using a clone or ghost element.
JavaScript
10
star
68

better-error

easier, more colorful, sprintf-style errors
JavaScript
10
star
69

spreadsheet

NOTE: this project is quite old. I won't be maintaining it anymore, but it should still work :-)
JavaScript
10
star
70

debounce

Underscore's debounce method as a component.
JavaScript
10
star
71

typewriter

Animated typing
JavaScript
9
star
72

async-script-promise

Asynchronously load scripts
JavaScript
9
star
73

reverse-regex

flip a regular expression. allows you to efficiently search backwards.
JavaScript
9
star
74

gist

Fluent gist API for node.js.
JavaScript
9
star
75

unyield

allow generators functions to accept callbacks
JavaScript
9
star
76

cheerio-select

Tiny wrapper around FB55's excellent CSSselect library.
JavaScript
9
star
77

terraform-provider-url

Simple little Terraform data source for parsing URLs.
Go
9
star
78

mdb

In-memory key/value store designed for concurrent use
Go
9
star
79

cursors

Collection of Mac's native cursor elements
8
star
80

increment

increment strings. good for keeping slugs unique.
JavaScript
8
star
81

wrapped

Low-level wrapper to provide a consistent interface for sync, async, promises, and generator functions.
JavaScript
8
star
82

yieldly

Conditionally make functions yieldable
JavaScript
8
star
83

events

Stand-alone event bindings as a component based on how Backbone's views handle events.
JavaScript
8
star
84

color

Extremely basic color tinting component
JavaScript
8
star
85

vscode-proofie

Proofie is an experimental proof-reader for VSCode that helps you write better.
TypeScript
8
star
86

plumbing

Pluggable plumbing for your Javascript libraries.
JavaScript
7
star
87

clock

Create a swiss railway inspired clock.
HTML
7
star
88

routematch

simple, functional route matcher for node.js and the browser.
JavaScript
7
star
89

character-iterator

Iterate through text characters in the DOM tree. Handles parent & sibling relationships.
JavaScript
7
star
90

sns.js

Simple publish and parse module for AWS SNS
JavaScript
7
star
91

hex-to-color-name

Tiny module to map hex colors to color names of your choice.
JavaScript
7
star
92

hackernews

Go
7
star
93

css-to-js-object

Experimental: convert css to a JS object.
JavaScript
7
star
94

number-to-letter

Simple utility to convert an arbitrary number to a letter
JavaScript
7
star
95

rework-count

Rework plugin to style elements based on the sibling count.
JavaScript
6
star
96

arg-deps

Statically inspect a function to get the properties of its arguments. Works with minified code.
JavaScript
6
star
97

extend.js

extend objects. extend(obj, obj2, ...)
JavaScript
6
star
98

invisibles

make spaces visible
JavaScript
6
star
99

email

fluent email using sendmail
JavaScript
6
star
100

coderunner-api

API for coderunner
JavaScript
6
star