• Stars
    star
    237
  • Rank 169,885 (Top 4 %)
  • Language
    HTML
  • License
    BSD 2-Clause "Sim...
  • Created over 13 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

a fast and platform independent readability port (JS)

readabilitySAX

a fast and platform independent readability port

About

This is a port of the algorithm used by the Readability bookmarklet to extract relevant pieces of information from websites, using a SAX parser.

The advantage over other ports, e.g. arrix/node-readability, is a smaller memory footprint and a much faster execution. In my tests, most pages, even large ones, were finished within 15ms (on node, see below for more information). It works with Rhino, so it runs on YQL, which may have interesting uses. And it works within a browser.

The Readability extraction algorithm was completely ported, but some adjustments were made:

  • <article> and <section> tags are recognized and gain a higher value

  • If a heading is part of the pages <title>, it is removed (Readability removed any single <h2>, and ignored other tags)

  • henry and instapaper-body are classes to show an algorithm like this where the content is. readabilitySAX recognizes them and adds additional points

  • Every bit of code that was taken from the original algorithm was optimized, eg. RegExps should now perform faster (they were optimized & use RegExp#test instead of String#match, which doesn't force the interpreter to build an array)

  • Some improvements made by GGReadability (an Obj-C port of Readability) were adopted

    • Images get additional scores when their height or width attributes are high - icon sized images (<= 32px) get skipped
    • Additional classes & ids are checked

How To

Install readabilitySAX

npm install readabilitySAX
CLI

A command line interface (CLI) may be installed via

npm install -g readabilitySAX

It's then available via

readability <domain> [<format>]

To get this readme, just run

readability https://github.com/FB55/readabilitySAX

The format is optional (it's either text or html, the default value is text).

Usage

Node

Just run require("readabilitySAX"). You'll get an object containing three methods:

  • Readability(settings): The readability constructor. It works as a handler for htmlparser2. Read more about it in the wiki!

  • WritableStream(settings, cb): A constructor that unites htmlparser2 and the Readability constructor. It's a writable stream, so simply .write all your data to it. Your callback will be called once .end was called. Bonus: You can also .pipe data into it!

  • createWritableStream(settings, cb): Returns a new instance of the WritableStream. (It's a simple factory method.)

There are two methods available that are deprecated and will be removed in a future version:

  • get(link, [settings], callback): Gets a webpage and process it.

  • process(data): Takes a string, runs readabilitySAX and returns the page.

Please don't use those two methods anymore. Streams are the way you should build interfaces in node, and that's what I want encourage people to use.

Browsers

I started to implement simplified SAX-"parsers" for Rhino/YQL (using E4X) and the browser (using the DOM) to increase the overall performance on those platforms. The DOM version is inside the /browsers dir.

A demo of how to use readabilitySAX inside a browser may be found at jsFiddle. Some basic example files are inside the /browsers directory.

YQL

A table using E4X-based events is available as the community table redabilitySAX, as well as here.

Parsers (on node)

Most SAX parsers (as sax.js) fail when a document is malformed XML, even if it's correct HTML. readabilitySAX should be used with htmlparser2, my fork of the htmlparser-module (used by eg. jsdom), which corrects most faults. It's listed as a dependency, so npm should install it with readabilitySAX.

Performance

Speed

Using a package of 724 pages from CleanEval (their website seems to be down, try to google it), readabilitySAX processed all of them in 5768 ms, that's an average of 7.97 ms per page.

The benchmark was done using tests/benchmark.js on a MacBook (late 2010) and is probably far from perfect.

Performance is the main goal of this project. The current speed should be good enough to run readabilitySAX on a singe-threaded web server with an average number of requests. That's an accomplishment!

Accuracy

The main goal of CleanEval is to evaluate the accuracy of an algorithm.

// TODO

Todo

  • Add documentation & examples
  • Add support for URLs containing hash-bangs (#!)
  • Allow fetching articles with more than one page
  • Don't remove all images inside <a> tags

More Repositories

1

htmlparser2

The fast & forgiving HTML and XML parser
TypeScript
4,057
star
2

css-select

a CSS selector compiler & engine
TypeScript
548
star
3

domhandler

Handler for htmlparser2, to get a DOM
TypeScript
311
star
4

entities

Encode & decode HTML & XML entities with ease & speed
TypeScript
293
star
5

css-what

a CSS selector parser
TypeScript
217
star
6

domutils

Utilities for working with htmlparser2's DOM
TypeScript
180
star
7

bitfield

A bitfield implementation using buffers, compliant with the BitTorrent spec.
TypeScript
80
star
8

nth-check

Parses and compiles CSS nth-checks to highly optimized functions.
TypeScript
52
star
9

cornet

transform streaming html using css selectors
JavaScript
44
star
10

domelementtype

all the types of nodes in htmlparser2's dom
TypeScript
27
star
11

high5

html 5 tokenizer
JavaScript
24
star
12

boolbase

two functions: one that returns true, one that returns false
JavaScript
14
star
13

inline

inline all images, stylesheets and scripts of a webpage
JavaScript
11
star
14

binopsy

Reimplementation of binary-parser supporting serialization and streaming
JavaScript
10
star
15

SimpleQueue

A simple FIFO queue
TypeScript
6
star
16

node-minreq

minimalistic request library for node
JavaScript
6
star
17

webshelf

my node knockout 2012 project
JavaScript
2
star
18

minschema

a (html form) schema builder & validator
JavaScript
1
star
19

node-fsi-dropbox

DEPRECATED
JavaScript
1
star
20

encoding-sniffer

HTML encoding sniffer, with stream support
TypeScript
1
star
21

fb55

Config files for my GitHub profile.
1
star
22

fb55.github.io

HTML
1
star
23

YQL-Tables-for-Google-Data-API

Tables to authenticate at and use the Google Data API
JavaScript
1
star
24

ReadableFeeds

runs readabilitySAX on feeds
JavaScript
1
star
25

level-insert

insert documents into a db with autoincrementing keys
JavaScript
1
star
26

funexp

UNFINISHED a functional regular expression library
JavaScript
1
star