Sinew is a Ruby library for collecting data from web sites (scraping). Though small, this project is the culmination of years of effort based on crawling systems built at several different companies. Sinew has been used to crawl millions of websites.
- Robust crawling with the Faraday HTTP client
- Aggressive caching with httpdisk
- Easy parsing with HTML cleanup, Nokogiri, JSON, etc.
- CSV generation for crawled data
# install gem
$ gem install sinew
# or add to your Gemfile:
gem 'sinew'
Breaking change
We are pleased to announce the release of Sinew 4. The Sinew DSL exposes a single sinew
method in lieu of the many methods exposed in Sinew 3. Because of this single entry point, Sinew is now much easier to embed in other applications. Also, each Sinew 4 request returns a full Response object to faciliate parallelism.
Sinew uses the Faraday HTTP client with the httpdisk middleware for aggressive caching of responses.
Here's an example for collecting the links from httpbingo.org. Paste this into a file called sample.sinew
and run sinew sample.sinew
. It will create a sample.csv
file containing the href and text for each link:
# get the url
response = sinew.get "https://httpbingo.org"
# use nokogiri to collect links
response.noko.css("ul li a").each do |a|
row = { }
row[:url] = a[:href]
row[:title] = a.text
# append a row to the csv
sinew.csv_emit(row)
end
There are three main features provided by Sinew.
Sinew uses recipe files to crawl web sites. Recipes have the .sinew extension, but they are plain old Ruby. Here's a trivial example that calls get
to make an HTTP GET request:
response = sinew.get "https://www.google.com/search?q=darwin"
response = sinew.get "https://www.google.com/search", q: "charles darwin"
Once you've done a get
, you can access the document in a few different formats. In general, it's easiest to use noko
to automatically parse and interact with HTML results. If Nokogiri isn't appropriate, fall back to regular expressions run against body
or html
. Use json
if you are expecting a JSON response.
response = sinew.get "https://www.google.com/search?q=darwin"
# pull out the links with nokogiri
links = response.noko.css("a").map { _1[:href] }
puts links.inspect
# or, use a regex
links = response.html[/<a[^>]+href="([^"]+)/, 1]
puts links.inspect
Recipes output CSV files. To continue the example above:
response = sinew.get "https://www.google.com/search?q=darwin"
response.noko.css("a").each do |i|
row = { }
row[:href] = i[:href]
row[:text] = i.text
sinew.csv_emit row
end
Sinew creates a CSV file with the same name as the recipe, and csv_emit(hash)
appends a row. The values of your hash are cleaned up and converted to strings:
- Nokogiri nodes are converted to text
- Arrays are joined with "|", so you can separate them later
- HTML tags, entities and non-ascii chars are removed
- Whitespace is squished
Sinew uses httpdisk to aggressively cache all HTTP responses to disk in ~/.sinew
. Error responses are cached as well. Each URL will be hit exactly once, and requests are rate limited to one per second. Sinew tries to be polite.
Sinew never deletes files from the cache - that's up to you! Sinew has various command line options to refresh the cache. See --expires
, --force
and --force-errors
.
Because all requests are cached, you can run Sinew repeatedly with confidence. Run it over and over again while you work on your recipe.
The sinew
command line has many useful options. You will be using this command many times as you iterate on your recipe:
$ bin/sinew --help
Usage: sinew [options] [recipe]
-l, --limit quit after emitting this many rows
--proxy use host[:port] as HTTP proxy
--timeout maximum time allowed for the transfer
-s, --silent suppress some output
-v, --verbose dump emitted rows while running
From httpdisk:
--dir set custom cache directory
--expires when to expire cached requests (ex: 1h, 2d, 3w)
--force don't read anything from cache (but still write)
--force-errors don't read errors from cache (but still write)
Sinew
also has many runtime options that can be set by in your recipe. For example:
sinew.options[:headers] = { 'User-Agent' => 'xyz' }
...
Here is the list of available options for Sinew
:
- headers - default HTTP headers to use on every request
- ignore_params - ignore these query params when generating httpdisk cache keys
- insecure - ignore SSL errors
- params - default query parameters to use on every request
- rate_limit - minimum time between network requests
- retries - number of times to retry each failed request
- url_prefix - deafult URL base to use on every request
sinew.get(url, params = nil, headers = nil)
- fetch a url with GETsinew.post(url, body = nil, headers = nil)
- fetch a url with POST, usingform
as the URL encoded POST body.sinew.post_json(url, body = nil, headers = nil)
- fetch a url with POST, usingjson
as the POST body.
Each request method returns a Sinew::Response
. The response has several helpers to make parsing easier:
body
- the raw bodyhtml
- likebody
, but with a handful of HTML-specific whitespace cleanupsnoko
- parse as HTML and return a Nokogiri documentxml
- parse as XML and return a Nokogiri documentjson
- parse as JSON, with symbolized keysmash
- parse as JSON and return a Hashie::Mashurl
- the url of the request. If the request goes through a redirect,url
will reflect the final url.
sinew.csv_header(columns)
- specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call tosinew.csv_emit
.sinew.csv_emit(hash)
- append a row to the CSV file
Sinew has some advanced helpers for checking the httpdisk cache. For the following methods, body
hashes default to form body type.
sinew.cached?(method, url, params = nil, body = nil)
- check if request is cachedsinew.uncache(method, url, params = nil, body = nil)
- remove cache file, if anysinew.status(method, url, params = nil, body = nil)
- get httpdisk status
Plus some caching helpers in Sinew::Response:
diskpath
- the location on disk for the cached httpdisk responseuncache
- remove cache file for this response
Writing Sinew recipes is fun and easy. The builtin caching means you can iterate quickly, since you won't have to re-fetch the data. Here are some hints for writing idiomatic recipes:
- Sinew doesn't (yet) check robots.txt - please check it manually.
- Prefer Nokogiri over regular expressions wherever possible. Learn CSS selectors.
- In Chrome,
$
in the console is your friend. - Fallback to regular expressions if you're desperate. Depending on the site, use either
body
orhtml
.html
is probably your best bet.body
is good for crawling Javascript, but it's fragile if the site changes. - Learn to love
String#[regexp]
, which is an obscure operator but incredibly handy for Sinew. - Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
- Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
noko.css("table")[4].css("td").select do
_1[:width].to_i > 80
end.map(&:text)
- Debug your recipes using plain old
puts
, or better yet useap
from amazing_print. - Run
sinew -v
to get a report on everycsv_emit
. Very handy. - Add the CSV files to your git repo. That way you can version them and get diffs!
- Caching is based on URL, so use caution with cookies and other forms of authentication
- Almost no support for international (non-english) characters
- Updated dependencies, added justfile
- Rewritten to use simpler DSL
- Upgraded to httpdisk 0.5 to take advantage of the new encoding support
- Major rewrite of network and caching layer. See above.
- Use Faraday HTTP client with sinew middleware for caching.
- Supports multiple proxies (
--proxy host1,host2,...
)
- Handle and cache more errors (too many redirects, connection failures, etc.)
- Support for adding uri.scheme in generate_cache_key
- Added status
code
, a peer touri
,raw
, etc.
- & now normalizes to & (not and)
- Support for
--limit
,--proxy
and thexml
variable - Dedup - warn and ignore if row[:url] has already been emitted
- Auto gunzip if contents are compressed
- Support for legacy cached
head
files from Sinew 1
- Complete rewrite. See above.
...
This extension is licensed under the MIT License.