• Stars
    star
    387
  • Rank 110,971 (Top 3 %)
  • Language
    Clojure
  • Created over 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Structural scraping for the rest of us.

Skyscraper

A framework that helps you build structured dumps of whole websites.

clojars CircleCI cljdoc

Concepts

Structural scraping and scrape trees

Think of Enlive. It allows you to parse arbitrary HTML and extract various bits of information out of it: subtrees or parts of subtrees determined by selectors. You can then convert this information to some other format, easier for machine consumption, or process it in whatever other way you wish. This is called scraping.

Now imagine that you have to parse a lot of HTML documents. They all come from the same site, so most of them are structured in the same way and can be scraped using the same sets of selectors. But not all of them. There’s an index page, which has a different layout and needs to be treated in its own peculiar way, with pagination and all. There are pages that group together individual pages in categories. And so on. Treating single pages is easy, but with whole collections of pages, you quickly find yourself writing a lot of boilerplate code.

In particular, you realize that you can’t just wget -r the whole thing and then parse each page in turn. Rather, you want to simulate the workflow of a user who tries to “click through” the website to obtain the information she’s interested in. Sites have tree-like structure, and you want to keep track of this structure as you traverse the site, and reflect it in your output. I call it “structural scraping”, and the tree of traversed pages and information extracted from each one – the “scrape tree”.

Contexts

A “context” is a map from keywords to arbitrary data. Think of it as “everything we have scraped so far”. A context has two special keys, :url and :processor, that contains the next URL to visit and the processor to handle it with (see below).

Scraping works by transforming context to list of contexts. You can think of it as a list monad. The initial list of contexts is supplied by the user, and typically contains a single map with an URL and a root processor.

A typical function producing an initial list of contexts (a seed) looks like this:

(defn seed [& _]
  [{:url "http://www.example.com",
    :processor :root-page}])

Processors

A “processor” is a unit of scraping: a function that processes sets of HTML pages in a uniform way.

Processors are defined with the defprocessor macro (which registers the processing function in a global registry). A typical processor, for a site’s landing page that contains links to other pages within table cells, might look like this:

(defprocessor :landing-page
  :cache-template "mysite/index"
  :process-fn (fn [res context]
                (for [a (select res [:td :a])]
                  {:page (text a),
                   :url (href a),
                   :processor :subpage})))

The most important clause is :process-fn. This is the function called by the processor to extract new information from a page and include it in the context. It takes two parameters:

  1. an Enlive resource corresponding to the parsed HTML tree of the page being processed,
  2. the current context (i.e., combined outputs of all processors so far).

The output should be a seq of maps that each have a new URL and a new processor (specified as a keyword) to invoke next.

Where to go from here

Explore the documentation. Have a look at examples in the examples/ directory of the repo. Read the docstrings, especially those of scrape and defprocessor.

If something is unclear, or you have suggestions or encounter a bug, please create an issue!

Caveats

Skyscraper is work in progress. Some things are missing. The API is still in flux. Function and macro names, input and output formats are liable to change at any time. Suggestions of improvements are welcome (preferably as GitHub issues), as are pull requests.

License

Copyright (C) 2015–2022 Daniel Janus, http://danieljanus.pl

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

More Repositories

1

clj-tagsoup

A HTML parser for Clojure.
Clojure
179
star
2

lithium

Clojure-based x86 assembler and toy Lisp compiler
Clojure
117
star
3

clj-iter

A Clojure iteration macro inspired by Common Lisp iterate.
Clojure
45
star
4

wordchampions

A fun word game!
Clojure
27
star
5

sunflower

Easily extract content from a bunch of similarly-formatted HTML files.
Clojure
23
star
6

soupscraper

dej, mam umierajoncom zupe
Clojure
19
star
7

smyrna

Prosty konkordancer dla języka polskiego
Clojure
18
star
8

summhn

Clojure
12
star
9

clj-tvision

Turbo Vision, the Clojure way
Clojure
11
star
10

cartestian

Test all the combinations
Clojure
9
star
11

clj-json-rpc

A Clojure handler for JSON-RPC compatible with Ring
Clojure
9
star
12

solitaire

Sample app for the re-frame workshop
Clojure
8
star
13

clj-bitfields

Easy accessing C-compatible packed bitfields from Clojure.
Clojure
7
star
14

gumtree-scraper

Gumtree RSS generator
Clojure
6
star
15

spleen

A Scrabble engine written in Clojure.
Clojure
5
star
16

koronalotek

na kogo wypadnie, na tego covid
Clojure
4
star
17

oswn

Operating System Without Name
Assembly
3
star
18

croissant

Yet another web-application framework in Common Lisp.
2
star
19

clj-nkjp

Clojure tools for processing the National Corpus of Polish
Clojure
2
star
20

nhp

Static site generator for my homepage
Clojure
2
star
21

ruby-continuation-webapp

Proof-of-concept continuation-based Sinatra webapp.
Ruby
2
star
22

dxces

A converter of text collections in .txt format to XCES for use with Poliqarp.
Python
2
star
23

haze

Haskellish Abominable Z-machine Emulator
Haskell
2
star
24

setgame

An implementation of Set game in Clojure.
Clojure
1
star
25

cl-netstrings

Reading and writing netstrings from/to binary streams in Common Lisp
Common Lisp
1
star
26

color-europe

Color your own Europe in Clojure!
Clojure
1
star
27

polelum

Clojure
1
star
28

skyscraper-cache-rocksdb

A cache backend for Skyscraper based on RocksDB.
1
star
29

psps

Przenośny Słownik Polskiego Scrabblisty
C
1
star
30

skyscraper-cache-mapdb

A MapDB-based cache backend for Skyscraper.
Clojure
1
star
31

blogs

My Octopress blogs.
Ruby
1
star
32

dotemacs

My Emacs configuration.
Emacs Lisp
1
star
33

versions

Research on Clojure version numbers, in Clojure.
Clojure
1
star
34

pallium

Clojure
1
star