• Stars
    star
    902
  • Rank 48,963 (Top 1.0 %)
  • Language
    Ruby
  • License
    Apache License 2.0
  • Created over 14 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Port of arc90's readability project to Ruby

Ruby Readability

Ruby Readability is a tool for extracting the primary readable content of a webpage. It is a Ruby port of arc90's readability project.

Build Status

Ruby

Install

Command line:

(sudo) gem install ruby-readability

Bundler:

gem "ruby-readability", :require => 'readability'

Example

require 'rubygems'
require 'readability'
require 'open-uri'

source = open('http://lab.arc90.com/experiments/readability/').read
puts Readability::Document.new(source).content

Options

You may provide options to Readability::Document.new, including:

  • :tags: the base whitelist of tags to sanitize, defaults to %w[div p];
  • :remove_empty_nodes: remove <p> tags that have no text content; also removes <p> tags that contain only images;
  • :attributes: whitelist of allowed attributes;
  • :debug: provide debugging output, defaults false;
  • :encoding: if the page is of a known encoding, you can specify it; if left unspecified, the encoding will be guessed (only in Ruby 1.9.x). If you wish to disable guessing, supply :do_not_guess_encoding => true;
  • :html_headers: in Ruby 1.9.x these will be passed to the guess_html_encoding gem to aid with guessing the HTML encoding;
  • :ignore_image_format: for use with .images. For example: :ignore_image_format => ["gif", "png"];
  • :min_image_height: set a minimum image height for #images;
  • :min_image_width: set a minimum image width for #images.
  • :blacklist and :whitelist allow you to explicitly scope to, or remove, CSS selectors.

Command Line Tool

Readability comes with a command-line tool for experimentation in bin/readability.

Usage: readability [options] URL
    -d, --debug                      Show debug output
    -i, --images                     Keep images and links
    -h, --help                       Show this message

Images

You can get a list of images in the content area with Document#images. This feature requires that the fastimage gem be installed.

rbody = Readability::Document.new(body, :tags => %w[div p img a], :attributes => %w[src href], :remove_empty_nodes => false)
rbody.images

Related Projects

  • readability.cr - Port of ruby-readability's port of arc90's readability project to Crystal
  • newspaper is an advanced news extraction, article extraction, and content curation library for Python.

Potential Issues

If you're on a Mac and are getting segmentation faults, see the discussion at sparklemotion/nokogiri#404 and consider updating your version of libxml2. Version 2.7.8 of libxml2, installed with brew, worked for me:

gem install nokogiri -- --with-xml2-include=/usr/local/Cellar/libxml2/2.7.8/include/libxml2 --with-xml2-lib=/usr/local/Cellar/libxml2/2.7.8/lib --with-xslt-dir=/usr/local/Cellar/libxslt/1.1.26

Or if you're using bundler and Rails 3, you can run this command to make bundler always globally build nokogiri this way:

bundle config build.nokogiri -- --with-xml2-include=/usr/local/Cellar/libxml2/2.7.8/include/libxml2 --with-xml2-lib=/usr/local/Cellar/libxml2/2.7.8/lib --with-xslt-dir=/usr/local/Cellar/libxslt/1.1.26

License

This code is under the Apache License 2.0. See http://www.apache.org/licenses/LICENSE-2.0.

Ruby port by cantino, starrhorne, libc, and iterationlabs. Special thanks to fizx and marcosinger.

More Repositories

1

mcfly

Fly through your shell history. Great Scott!
Rust
5,564
star
2

selectorgadget

Go go CSS / DOM inspection.
JavaScript
991
star
3

reckon

Flexibly import bank account CSV files into Ledger for command-line accounting
Ruby
403
star
4

my_obfuscate

Standalone Ruby code for the selective re-writing of SQL dumps in order to protect user privacy.
Ruby
87
star
5

browser-friend

GPT in your browser
TypeScript
84
star
6

walker_method

A Ruby implementation of Walker's Alias Method for quickly sampling from an array with a given probability distribution
Ruby
71
star
7

twitter_to_csv

Dump the Twitter stream to JSON and CSV, then apply filters, reject non-English content, do sentiment analysis, and more.
Ruby
64
star
8

post_location

An iOS application that POSTs your location to a webhook of your choosing.
Ruby
53
star
9

jsoneditor

JavaScript widget for inline JSON editing
JavaScript
46
star
10

expando

A jQuery plugin for text that grows on you
HTML
44
star
11

chrome_pipe

A Chrome extension experiment with JavaScript UNIXy pipes
JavaScript
30
star
12

ruby_on_ruby

An unholy amalgam of therubyracer's V8 engine and emscripted-ruby to allow a truly sandboxed Ruby-on-Ruby environment.
Ruby
25
star
13

heroku-selectable-procfile

A Heroku Buildpack that allows Procfile selection by environmental variable. Chain it using https://github.com/ddollar/heroku-buildpack-multi
Shell
17
star
14

airtable-ml

Neural network-based Airtable Block for automatic prediction & classification
TypeScript
16
star
15

guess_html_encoding

A small gem that attempts to guess and then force encoding of HTML documents for Ruby 1.9
Ruby
10
star
16

rquad

Ruby Quadtree Library
Ruby
9
star
17

ideamachine

IdeaMachine - A generative grammar for fun and profit
Ruby
5
star
18

threeve

Bringing Ruby's math into the 21st century
Ruby
3
star
19

huginn_xero_agent

Huginn Agent for Xero invoice creation
Ruby
3
star
20

confabulator

Ruby generative grammer for conversational text
Ruby
2
star
21

andrewcantino.com

My personal website
HTML
2
star
22

feature_set

Ruby
2
star
23

active_record_record

Record ActiveRecord's Record Allocation
Ruby
1
star
24

blog.andrewcantino.com

The source of my Jekyll-powered blog
HTML
1
star