• Stars
    star
    747
  • Rank 60,741 (Top 2 %)
  • Language
    Ruby
  • License
    Other
  • Created almost 15 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extracts machine-readable metadata and content from Web pages

pismo - Web page content analysis and metadata extraction

DESCRIPTION:

Pismo extracts machine-usable metadata from unstructured (or poorly structured) English-language HTML documents. Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords.

All tests pass on Ruby 1.9.3 and 2.0.0. Currently fails on JRuby 1.7.2 due to dependencies.

NEWS:

March 25, 2013: Version 0.8.0 is now the edge version (but not released as a gem yet). It may be incompatible with earlier releases as it has a LOT of commits and changes made by other people which have not yet been fully tested or audited. Install gem for 0.7.4 if you wish to remain on the 'stable' version for now.

February 27, 2013: Version 0.7.4 has been released to ensure Ruby 2.0.0 compatibility but significant pull requests remain yet to be merged and handled.

December 19, 2010: Version 0.7.2 has been released - it includes a patch from Darcy Laycock to fix keyword extraction problems on some pages, has switched from Jeweler to Bundler for management of the gem, and adds support for JRuby 1.5.6 by skipping stemming on that platform.

USAGE:

A basic example of extracting basic metadata from a Web page:

require 'pismo'

# Load a Web page (you could pass an IO object or a string with existing HTML data along, as you prefer)
doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')

doc.title     # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
doc.author    # => "Peter Cooper"
doc.lede      # => "Cramp (GitHub repo) is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
doc.keywords  # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]

There's also a shorter "convenience" method which might be handy in IRB - it does the same as Pismo::Document.new:

Pismo['http://www.rubyflow.com/items/4082'].title   # => "Install Ruby as a non-root User"

The current metadata methods are:

  • title
  • titles
  • author
  • authors
  • lede
  • keywords
  • sentences(qty)
  • body
  • html_body
  • feed
  • feeds
  • favicon
  • description
  • datetime

These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.

The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader". #body returns it as plain-text, #html_body maintains some basic HTML styling.

The default reader is the "tree" reader. This works in a similar fashion to Arc90's Readability or Safari Reader algorithm.

CAVEATS AND SHORTCOMINGS:

There are some shortcomings or problems that I'm aware of and am going to pursue:

  • I do not know how Pismo fares on Rubinius
  • pismo requires Bundler - get it :-)
  • pismo does not install on JRuby due to a problem in the fast-stemmer dependency
  • Some users have had issues with using Pismo from irb. This appears to be related to Nokogiri use causing a segfault
  • The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction
  • The author name extraction isn't very strong and is best avoided for now
  • The image extraction only deals with images with absolute URLs (optional; pass :all_images => true to Pismo::Document.new to include relative images)
  • The corpus in test/corpus needs significantly extending

OTHER GROOVY STUFF:

Command Line Tool

A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.

Usage:

./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime

Output:

---
:url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
:title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
:lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
:author: Peter Cooper
:datetime: 2010-01-07 12:00:00 +00:00

If you call pismo without any arguments (except a URL), it starts an IRB session so you can directly work in Ruby. The URL provided is loaded and assigned to both the constant 'P' and the variable @p.

Alternate readers

Pismo supports different readers for extracting the #body and #html_body from the web page.

The "cluster" reader uses an algorithm that tries to cluster contiguous content blocks together to identify the main document body. This is based on the ExtractContent gem (http://rubyforge.org/projects/extractcontent/).

The reader can be specified as part of #Document.new :

doc = Document.new(url, :reader => :cluster)

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add tests for it. This is important so I don't break it in a future version unintentionally.
  • Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
  • Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)

COPYRIGHT AND LICENSE

Apache 2.0 License - See LICENSE for details. Copyright (c) 2009, 2010, 2013 Peter Cooper et al.

In short, you can use Pismo for whatever you like commercial or not, but please include a brief credit (as in the NOTICE file - as per the Apache 2.0 License) somewhere deep in your license file or similar, and, if you're nice and have the time, let me know if you're using it and/or share any significant changes or improvements you make.

http://github.com/peterc/pismo

More Repositories

1

whatlanguage

A language detection library for Ruby that uses bloom filters for speed.
Ruby
685
star
2

testrocket

Super simple Ruby testing library
Ruby
237
star
3

engblogs

Engineering Blogs
Ruby
119
star
4

bitarray

Pure Ruby bit array/bitfield implementation
Ruby
105
star
5

hackerslide

A sliding view of the Hacker News front page over time
JavaScript
76
star
6

trtl

Tk-powered Ruby turtle graphics
Ruby
65
star
7

chrome2gif

Dynamically create an animated GIF of a page running in Chrome
JavaScript
63
star
8

rsmaz

Ruby port of Smaz - a short string compression library
Ruby
44
star
9

multirb

Run Ruby code over multiple implementations/versions using RVM from a IRB-esque prompt
Ruby
43
star
10

potc-jruby

JRuby port of Prelude of the Chambered, a Java game
Ruby
41
star
11

videocr

Perform OCR upon entire videos to look for credentials or similar.
Python
39
star
12

webloc

Read and write .webloc (web link) files on macOS / OS X
Ruby
24
star
13

switchpipe

SwitchPipe is a backend process manager and HTTP proxy that makes (especially Ruby) web app deployment simple. NOW OBSOLETE. DO NOT USE.
Ruby
14
star
14

webassembly-simplest-demo

A simple example of compiling C to WebAssembly and running it
HTML
13
star
15

simredis

Redis simulator that allows you to use redis-rb without a Redis daemon running
Ruby
10
star
16

monos

My Ludum Dare 22 entry
Ruby
8
star
17

coffeebots

A programmable robot war game in CoffeeScript / JavaScript
CoffeeScript
4
star
18

illustrator-cc-scripting

Help and resources on scripting Illustrator CC on macOS
JavaScript
4
star
19

igsubscriber

Streams live market data from IG.com's Lightstreamer into a Redis data store
JavaScript
4
star
20

massiveattract

A game developed in a few hours for Ludum Dare 23
JavaScript
2
star
21

bits

Random bits of code I want to keep track of
Ruby
1
star
22

hntitles

Hacker News title edit tracker
Ruby
1
star
23

cardnut

Simple backend for pushing Twilio text messages over WebSocket to a client (not useful for many)
JavaScript
1
star
24

toto

MIDI proxying for great fun with keyboards
Ruby
1
star
25

superhighway

superhighway.dev
HTML
1
star
26

ruranopupore

An Ubuntu 12.04 LTS Ruby, Rails, Nginx, Node.js, Puma, Postgres, and Redis Ansible Playbook Collection
1
star
27

aoc2019solutions

My Advent of Code 2019 Solutions
Ruby
1
star
28

herokuexperiment

Running two processes (web and worker) on a single Heroku dyno
Ruby
1
star