• Stars
    star
    6,470
  • Rank 5,836 (Top 0.2 %)
  • Language
    JavaScript
  • License
    Other
  • Created about 9 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A standalone version of the readability lib

Readability.js

A standalone version of the readability library used for Firefox Reader View.

Installation

Readability is available on npm:

npm install @mozilla/readability

You can then require() it, or for web-based projects, load the Readability.js script from your webpage.

Basic usage

To parse a document, you must create a new Readability object from a DOM document object, and then call the parse() method. Here's an example:

var article = new Readability(document).parse();

If you use Readability in a web browser, you will likely be able to use a document reference from elsewhere (e.g. fetched via XMLHttpRequest, in a same-origin <iframe> you have access to, etc.). In Node.js, you can use an external DOM library.

API Reference

new Readability(document, options)

The options object accepts a number of properties, all optional:

  • debug (boolean, default false): whether to enable logging.
  • maxElemsToParse (number, default 0 i.e. no limit): the maximum number of elements to parse.
  • nbTopCandidates (number, default 5): the number of top candidates to consider when analysing how tight the competition is among candidates.
  • charThreshold (number, default 500): the number of characters an article must have in order to return a result.
  • classesToPreserve (array): a set of classes to preserve on HTML elements when the keepClasses options is set to false.
  • keepClasses (boolean, default false): whether to preserve all classes on HTML elements. When set to false only classes specified in the classesToPreserve array are kept.
  • disableJSONLD (boolean, default false): when extracting page metadata, Readability gives precendence to Schema.org fields specified in the JSON-LD format. Set this option to true to skip JSON-LD parsing.
  • serializer (function, default el => el.innerHTML) controls how the the content property returned by the parse() method is produced from the root DOM element. It may be useful to specify the serializer as the identity function (el => el) to obtain a DOM element instead of a string for content if you plan to process it further.
  • allowedVideoRegex (RegExp, default undefined ): a regular expression that matches video URLs that should be allowed to be included in the article content. If undefined, the default regex is applied.

parse()

Returns an object containing the following properties:

  • title: article title;
  • content: HTML string of processed article content;
  • textContent: text content of the article, with all the HTML tags removed;
  • length: length of an article, in characters;
  • excerpt: article description, or short excerpt from the content;
  • byline: author metadata;
  • dir: content direction;
  • siteName: name of the site.
  • lang: content language

The parse() method works by modifying the DOM. This removes some elements in the web page, which may be undesirable. You can avoid this by passing the clone of the document object to the Readability constructor:

var documentClone = document.cloneNode(true);
var article = new Readability(documentClone).parse();

isProbablyReaderable(document, options)

A quick-and-dirty way of figuring out if it's plausible that the contents of a given document are suitable for processing with Readability. It is likely to produce both false positives and false negatives. The reason it exists is to avoid bogging down a time-sensitive process (like loading and showing the user a webpage) with the complex logic in the core of Readability. Improvements to its logic (while not deteriorating its performance) are very welcome.

The options object accepts a number of properties, all optional:

  • minContentLength (number, default 140): the minimum node content length used to decide if the document is readerable;
  • minScore (number, default 20): the minumum cumulated 'score' used to determine if the document is readerable;
  • visibilityChecker (function, default isNodeVisible): the function used to determine if a node is visible;

The function returns a boolean corresponding to whether or not we suspect Readability.parse() will suceeed at returning an article object. Here's an example:

/*
    Only instantiate Readability  if we suspect
    the `parse()` method will produce a meaningful result.
*/
if (isProbablyReaderable(document)) {
    let article = new Readability(document).parse();
}

Node.js usage

Since Node.js does not come with its own DOM implementation, we rely on external libraries like jsdom. Here's an example using jsdom to obtain a DOM document object:

var { Readability } = require('@mozilla/readability');
var { JSDOM } = require('jsdom');
var doc = new JSDOM("<body>Look at this cat: <img src='./cat.jpg'></body>", {
  url: "https://www.example.com/the-page-i-got-the-source-from"
});
let reader = new Readability(doc.window.document);
let article = reader.parse();

Remember to pass the page's URI as the url option in the JSDOM constructor (as shown in the example above), so that Readability can convert relative URLs for images, hyperlinks etc. to their absolute counterparts.

jsdom has the ability to run the scripts included in the HTML and fetch remote resources. For security reasons these are disabled by default, and we strongly recommend you keep them that way.

Security

If you're going to use Readability with untrusted input (whether in HTML or DOM form), we strongly recommend you use a sanitizer library like DOMPurify to avoid script injection when you use the output of Readability. We would also recommend using CSP to add further defense-in-depth restrictions to what you allow the resulting content to do. The Firefox integration of reader mode uses both of these techniques itself. Sanitizing unsafe content out of the input is explicitly not something we aim to do as part of Readability itself - there are other good sanitizer libraries out there, use them!

Contributing

Please see our Contributing document.

License

Copyright (c) 2010 Arc90 Inc

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

More Repositories

1

pdf.js

PDF Reader in JavaScript
JavaScript
43,965
star
2

DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
C++
24,221
star
3

send

Simple, private file sharing from the makers of Firefox
FreeMarker
13,225
star
4

sops

Simple and flexible tool for managing secrets
Go
12,778
star
5

BrowserQuest

A HTML5/JavaScript multiplayer game experiment
JavaScript
9,167
star
6

nunjucks

A powerful templating engine with inheritance, asynchronous control, and more (jinja2 inspired)
JavaScript
8,415
star
7

geckodriver

WebDriver for Firefox
6,911
star
8

TTS

πŸ€– πŸ’¬ Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Jupyter Notebook
6,749
star
9

sccache

Sccache is a ccache-like tool. It is used as a compiler wrapper and avoids compilation when possible. Sccache has the capability to utilize caching in remote storage environments, including various cloud storage options, or alternatively, in local storage.
Rust
5,334
star
10

mozjpeg

Improved JPEG encoder.
C
5,216
star
11

Fira

Mozilla's new typeface, used in Firefox OS
CSS
4,920
star
12

rhino

Rhino is an open-source implementation of JavaScript written entirely in Java
JavaScript
3,956
star
13

shumway

Shumway is a Flash VM and runtime written in JavaScript
TypeScript
3,692
star
14

source-map

Consume and generate source maps.
JavaScript
3,471
star
15

gecko-dev

Read-only Git mirror of the Mercurial gecko repositories at https://hg.mozilla.org. How to contribute: https://firefox-source-docs.mozilla.org/contributing/contribution_quickref.html
2,897
star
16

multi-account-containers

Firefox Multi-Account Containers lets you keep parts of your online life separated into color-coded tabs that preserve your privacy. Cookies are separated by container, allowing you to use the web with multiple identities or accounts simultaneously.
JavaScript
2,594
star
17

bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
Python
2,590
star
18

web-ext

A command line tool to help build, run, and test web extensions
JavaScript
2,557
star
19

node-convict

Featureful configuration management library for Node.js
JavaScript
2,304
star
20

MozDef

DEPRECATED - MozDef: Mozilla Enterprise Defense Platform
Python
2,173
star
21

cbindgen

A project for generating C bindings from Rust code
Rust
2,157
star
22

popcorn-js

The HTML5 Media Framework. (Unmaintained. See https://github.com/menismu/popcorn-js for activity)
JavaScript
2,148
star
23

webextension-polyfill

A lightweight polyfill library for Promise-based WebExtension APIs in Chrome
JavaScript
2,088
star
24

fathom

A framework for extracting meaning from web pages
JavaScript
1,972
star
25

cipherscan

A very simple way to find out which SSL ciphersuites are supported by a target.
Python
1,912
star
26

hawk

HTTP Holder-Of-Key Authentication Scheme
JavaScript
1,903
star
27

persona

Persona is a secure, distributed, and easy to use identification system.
JavaScript
1,828
star
28

http-observatory

Mozilla HTTP Observatory
Python
1,784
star
29

uniffi-rs

a multi-language bindings generator for rust
Rust
1,783
star
30

neqo

Neqo, an implementation of QUIC in Rust
Rust
1,759
star
31

mentat

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.
Rust
1,652
star
32

task.js

Beautiful concurrency for JavaScript
JavaScript
1,635
star
33

hubs

Duck-themed multi-user virtual spaces in WebVR. Built with A-Frame.
JavaScript
1,561
star
34

thimble.mozilla.org

UPDATE: This project is no longer maintained. Please check out Glitch.com instead.
JavaScript
1,423
star
35

fx-private-relay

Keep your email safe from hackers and trackers. Make an email alias with 1 click, and keep your address to yourself.
Python
1,415
star
36

pontoon

Mozilla's Localization Platform
Python
1,396
star
37

kitsune

Platform for Mozilla Support
Python
1,247
star
38

mig

Distributed & real time digital forensics at the speed of the cloud
Go
1,195
star
39

OpenWPM

A web privacy measurement framework
Python
1,150
star
40

bedrock

Making mozilla.org awesome, one pebble at a time
HTML
1,149
star
41

server-side-tls

Server side TLS Tools
HTML
1,114
star
42

grcov

Rust tool to collect and aggregate code coverage data for multiple source files
Rust
1,106
star
43

policy-templates

Policy Templates for Firefox
1,105
star
44

rust-android-gradle

Kotlin
989
star
45

pdfjs-dist

Generic build of PDF.js library.
JavaScript
952
star
46

contain-facebook

Facebook Container isolates your Facebook activity from the rest of your web activity in order to prevent Facebook from tracking you outside of the Facebook website via third party cookies.
JavaScript
945
star
47

narcissus

INACTIVE - http://mzl.la/ghe-archive - The Narcissus meta-circular JavaScript interpreter
JavaScript
901
star
48

openbadges-backpack

Mozilla Open Badges Backpack
JavaScript
861
star
49

addons-server

πŸ•Ά addons.mozilla.org Django app and API πŸŽ‰
Python
833
star
50

awsbox

INACTIVE - http://mzl.la/ghe-archive - A featherweight PaaS on top of Amazon EC2 for deploying node apps
JavaScript
811
star
51

dxr

DEPRECATED - Powerful search for large codebases
Python
804
star
52

ssh_scan

DEPRECATED - A prototype SSH configuration and policy scanner (Blog: https://mozilla.github.io/ssh_scan/)
Ruby
796
star
53

chromeless

DEPRECATED - Build desktop applications with web technologies.
JavaScript
761
star
54

node-client-sessions

secure sessions stored in cookies
JavaScript
745
star
55

playdoh

PROJECT DEPRECATED (WAS: "Mozilla's Web application base template. Half Django, half awesomeness, half not good at math.")
Python
714
star
56

DeepSpeech-examples

Examples of how to use or integrate DeepSpeech
Python
682
star
57

blurts-server

Firefox Monitor arms you with tools to keep your personal information safe. Find out what hackers already know about you and learn how to stay a step ahead of them.
Fluent
679
star
58

tofino

Project Tofino is a browser interaction experiment.
HTML
655
star
59

addon-sdk

DEPRECATED - The Add-on SDK repository.
641
star
60

MozStumbler

Android Stumbler for Mozilla
Java
614
star
61

application-services

Firefox Application Services
Rust
598
star
62

standards-positions

Python
595
star
63

lightbeam

Orignal unmaintained version of the Lightbeam extension. See lightbeam-we for the new one which works in modern versions of Firefox.
JavaScript
587
star
64

moz-sql-parser

DEPRECATED - Let's make a SQL parser so we can provide a familiar interface to non-sql datastores!
Python
574
star
65

firefox-translations

Firefox Translations is a webextension that enables client side translations for web browsers.
JavaScript
571
star
66

spidernode

Node.js on top of SpiderMonkey
JavaScript
560
star
67

inclusion

Our repository for Diversity, Equity and Inclusion work at Mozilla
557
star
68

positron

a experimental, Electron-compatible runtime on top of Gecko
551
star
69

fxa

Monorepo for Firefox Accounts
JavaScript
547
star
70

cargo-vet

supply-chain security for Rust
Rust
547
star
71

ichnaea

Mozilla Ichnaea
Python
539
star
72

addons-frontend

Front-end to complement mozilla/addons-server
JavaScript
525
star
73

tls-observatory

An observatory for TLS configurations, X509 certificates, and more.
Go
518
star
74

neo

INACTIVE - http://mzl.la/ghe-archive - DEPRECATED: See https://neutrino.js.org for alternative
JavaScript
503
star
75

notes

DEPRECATED - A notepad for Firefox
HTML
493
star
76

nixpkgs-mozilla

Mozilla overlay for Nixpkgs.
Nix
490
star
77

bugbug

Platform for Machine Learning projects on Software Engineering
Python
487
star
78

django-csp

Content Security Policy for Django.
Python
486
star
79

skywriter

Mozilla Skywriter
JavaScript
481
star
80

Spoke

Easily create custom 3D environments
JavaScript
480
star
81

zamboni

Backend for the Firefox Marketplace
Python
475
star
82

vtt.js

A JavaScript implementation of the WebVTT specification
JavaScript
461
star
83

libdweb

Extension containing an experimental libdweb APIs
JavaScript
441
star
84

FirefoxColor

Theming demo for Firefox Quantum and beyond
JavaScript
437
star
85

pointer.js

INACTIVE - http://mzl.la/ghe-archive - INACTIVE - http://mzl.la/ghe-archive - Normalizes mouse/touch events into 'pointer' events.
JavaScript
435
star
86

mozilla-django-oidc

A django OpenID Connect library
Python
418
star
87

cubeb

Cross platform audio library
C++
411
star
88

agithub

Agnostic Github client API -- An EDSL for connecting to REST servers
Python
410
star
89

fxa-auth-server

DEPRECATED - Migrated to https://github.com/mozilla/fxa
JavaScript
401
star
90

zilla-slab

Mozilla's Zilla Slab Type Family
Shell
391
star
91

r2d2b2g

Firefox OS Simulator is a test environment for Firefox OS. Use it to test your apps in a Firefox OS-like environment that looks and feels like a mobile phone.
JavaScript
391
star
92

masche

Deprecated - MIG Memory Forensic library
Go
387
star
93

qbrt

CLI to a Gecko desktop app runtime
JavaScript
386
star
94

mp4parse-rust

Parser for ISO Base Media Format aka video/mp4 written in Rust.
Rust
380
star
95

valence

INACTIVE - http://mzl.la/ghe-archive - Firefox Developer Tools protocol adapters (Unmaintained)
JavaScript
377
star
96

OpenDesign

Mozilla Open Design aims to bring open source principles to Creative Design. Find us on Matrix: chat.mozilla.org/#/room/#opendesign:mozilla.org
367
star
97

reflex

Functional reactive UI library
JavaScript
364
star
98

mortar

INACTIVE - http://mzl.la/ghe-archive - A collection of web app templates
364
star
99

minion

Minion
354
star
100

makedrive

[RETIRED] Webmaker Filesystem
JavaScript
352
star