• This repository has been archived on 26/Aug/2020
  • Stars
    star
    157
  • Rank 238,399 (Top 5 %)
  • Language
    Ruby
  • License
    MIT License
  • Created about 16 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

LOOKING FOR A MAINTAINER

ScrAPI toolkit for Ruby

A framework for writing scrapers using CSS selectors and simple select => extract => store processing rules.

Here’s an example that scrapes auctions from eBay:

ebay_auction = Scraper.define do
  process "h3.ens>a", :description=>:text,
                      :url=>"@href"
  process "td.ebcPr>span", :price=>:text
  process "div.ebPicture >a>img", :image=>"@src"

  result :description, :url, :price, :image
end

ebay = Scraper.define do
  array :auctions

  process "table.ebItemlist tr.single",
          :auctions => ebay_auction

  result :auctions
end

And using the scraper:

auctions = ebay.scrape(html)

# No. of auctions found
puts auctions.size

# First auction:
auction = auctions[0]
puts auction.description
puts auction.url

To get the latest source code with regular updates:

svn co labnotes.org/svn/public/ruby/scrapi

Version of Ruby

ScrAPI 1.2.x tested with Ruby 1.8.6 and 1.8.7, but will not work on Ruby 1.9.x.

ScrAPI 2.0.x switches to TidyFFI to runs on Ruby 1.9.2 and newer.

Due to a bug in Ruby’s visibility context handling (see changelog #29578 and bug #3406 on the official Ruby page), you need to declare all result attributes explicitly, using result method or attr_reader/_accessor.

Using TIDY

By default scrAPI uses Tidy (actually Tidy-FFI) to cleanup the HTML.

You need to install the Tidy Gem for Ruby:

gem install tidy_ffi

And the Tidy binary libraries, available here:

http://tidy.sourceforge.net/

By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That’s one place to place the Tidy library.

Alternatively, just point Tidy to the library with:

TidyFFI.library_path = "...."

On Linux this would probably be:

TidyFFI.library_path = "/usr/local/lib/libtidy.so"

On OS/X this would probably be:

TidyFFI.library_path = /usr/lib/libtidy.dylib”

For testing purposes, you can also use the built in HTML parser. It’s useful for testing and getting up to grabs with scrAPI, but it doesn’t deal well with broken HTML. So for testing only:

Scraper::Base.parser :html_parser

License

Copyright © 2006 Assaf Arkin, under Creative Commons Attribution and/or MIT License

Developed for co.mments.com

Code and documention: labnotes.org

HTML cleanup and good hygene by Tidy, Copyright © 1998-2003 World Wide Web Consortium. License at tidy.sourceforge.net/license.html

HTML DOM extracted from Rails, Copyright © 2004 David Heinemeier Hansson. Under MIT license.

HTML parser by Takahiro Maebashi and Katsuyuki Komatsu, Ruby license. www.jin.gr.jp/~nahi/Ruby/html-parser/README.html

Porting to Ruby 1.9.x by Christoph Lupprich, lupprich.info

More Repositories

1

zombie

Insanely fast, full-stack, headless browser testing using node.js
JavaScript
5,656
star
2

vanity

Experiment Driven Development for Ruby
Ruby
1,546
star
3

node-replay

When API testing slows you down: record and replay HTTP responses like a boss
JavaScript
522
star
4

uuid

Generates universally unique identifiers (UUIDs) for use in distributed applications.
Ruby
480
star
5

node-passbook

iOS Passbook for the Node hacker
JavaScript
280
star
6

rack-oauth2-server

LOOKING FOR MAINTAINER — OAuth 2.0 Authorization Server as a Rack module
Ruby
232
star
7

fine-tune

👋 The missing UI for working with OpenAI: manage your files, and fine tune models.
TypeScript
76
star
8

ironium

Job queues and scheduled jobs for Node.js, Beanstalkd and/or Iron.io.
JavaScript
72
star
9

sideline

Sideline, a CoffeeScript shell for your server (NO LONGER MAINTAINED)
CoffeeScript
39
star
10

whisper-to-me

SVG graphs from Whisper files
CoffeeScript
37
star
11

necktie

NO LONGER MAINTAINED
Ruby
30
star
12

ruby-in-practice

Source code and examples from the book Ruby In Practice
Ruby
29
star
13

queue-run

👋 Web 2.0 framework to make building back-ends and APIs easy and fun: HTTP, FIFO queues, WebSocket, and more …
TypeScript
26
star
14

pipemaster

NO LONGER MAINTAINED
Ruby
24
star
15

css-annotate

The Annotated CSS tells you which style to use where
Ruby
23
star
16

octolog

Github is our single sign-on octopus
CoffeeScript
20
star
17

reliable-msg

NO LONGER MAINTAINED
Ruby
15
star
18

dailyhi

Start the morning with a friendly Hi in your inbox.
Ruby
12
star
19

archive-2005

Ruby
10
star
20

highfive

HTML 5/CSS 3/jQuery goodness
Ruby
9
star
21

vanity.js

NO LONGER MAINTAINED
CoffeeScript
8
star
22

lazybird

Lazy promises using Bluebird (NO LONGER MAINTAINED)
JavaScript
4
star
23

react-one-tap

Google One Tap sign-in for React
TypeScript
4
star
24

assaf

1
star