• Stars
    star
    260
  • Rank 156,895 (Top 4 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 10 years ago
  • Updated about 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A scraping command line tool for the modern web

quickscrape NPM version license MIT Downloads Build Status

quickscrape is a simple command-line tool for powerful, modern website scraping.

Table of Contents

Description

quickscrape is not like other scraping tools. It is designed to enable large-scale content mining. Here's what makes it different:

Websites can be rendered in a GUI-less browser (PhantomJS via CasperJS). This has some important benefits:

  • Many modern websites are only barely specified in their HTML, but are rendered with Javascript after the page is loaded. Headless browsing ensures the version of the HTML you scrape is the same one human visitors would see on their screen.
  • User interactions can be simulated. This is useful whenever content is only loaded after interaction, for example when article content is gradually loaded by AJAX during scrolling.
  • The full DOM specification is supported (because the backend is WebKit). This means pages with complex Javascripts that use rare parts of the dom (for example, Facebook) can be rendered, which they cannot in most existing tools.

Scrapers are defined in separate JSON files that follow a defined structure (scraperJSON). This too has important benefits:

  • No programming required! Non-programmers can make scrapers using a text editor and a web browser with an element inspector (e.g. Chrome).
  • Large collections of scrapers can be maintained to retrieve similar sets of information from different pages. For example: newspapers or academic journals.
  • Any other software supporting the same format could use the same scraper definitions.

quickscrape is being developed to allow the community early access to the technology that will drive ContentMine, such as ScraperJSON and our Node.js scraping library thresher.

The software is under rapid development, so please be aware there may be bugs. If you find one, please report it on the issue tracker.

Installation

Prerequisites

You'll need Node.js (node), a platform which enables standalone JavaScript apps. You'll also need the Node package manager (npm), which usually comes with Node.js. Installing Node.js via the operating system's package manager leads to issues. If you already have Node.js installed, and it requires sudo to install node packages, that's the wrong way. The easiest way to do it right on Unix systems (e.g. Linux, OSX) is to use NVM, the Node version manager.

First, install NVM:

curl https://raw.githubusercontent.com/creationix/nvm/v0.24.1/install.sh | bash

or, if you don't have curl:

wget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.24.1/install.sh | bash

NB: on OSX, you will need to have the developer tools installed (e.g. by installing XCode).

Then, install the latest Node.js, which will automatically install the latest npm as well, and set that version as the default:

source ~/.nvm/nvm.sh
nvm install 0.10
nvm alias default 0.10
nvm use default

Now you should have node and npm available. Check by running:

node -v
npm -v

If both of those printed version numbers, you're ready to move on to installing quickscrape.

Quickscrape

quickscrape is very easy to install. Simply:

npm install --global quickscrape

Documentation

Run quickscrape --help from the command line to get help:

Usage: quickscrape [options]

Options:

-h, --help               output usage information
-V, --version            output the version number
-u, --url <url>          URL to scrape
-r, --urllist <path>     path to file with list of URLs to scrape (one per line)
-s, --scraper <path>     path to scraper definition (in JSON format)
-d, --scraperdir <path>  path to directory containing scraper definitions (in JSON format)
-o, --output <path>      where to output results (directory will be created if it doesn't exist
-r, --ratelimit <int>    maximum number of scrapes per minute (default 3)
-h --headless            render all pages in a headless browser
-l, --loglevel <level>   amount of information to log (silent, verbose, info*, data, warn, error, or debug)
-f, --outformat <name>   JSON format to transform results into (currently only bibjson)

You must provide scraper definitions in ScraperJSON format as used in the ContentMine journal-scrapers.

Examples

1. Extract data from a single URL with a predefined scraper

First, you'll want to grab some pre-cooked definitions:

git clone https://github.com/ContentMine/journal-scrapers.git

Now just run quickscrape:

quickscrape \
  --url https://peerj.com/articles/384 \
  --scraper journal-scrapers/scrapers/peerj.json \
  --output peerj-384
  --outformat bibjson

You'll see log messages informing you how the scraping proceeds:

Single URL log output

Then in the peerj-384 directory there are several files:

$ tree peerj-384
peerj-384/
  └── https_peerj.com_articles_384
    ├── bib.json
    ├── fig-1-full.png
    ├── fulltext.html
    ├── fulltext.pdf
    ├── fulltext.xml
    └── results.json
  • fulltext.html is the fulltext HTML (duh!)
  • results.json is a JSON file containing all the captured data
  • bib.json is a JSON file containing the results in bibJSON format
  • fig-1-full.png is the downloaded image from the only figure in the paper

results.json looks like this (truncated):

{
  "publisher": {
    "value": [
      "PeerJ Inc."
    ]
  },
  "journal_name": {
    "value": [
      "PeerJ"
    ]
  },
  "journal_issn": {
    "value": [
      "2167-8359"
    ]
  },
  "title": {
    "value": [
      "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss"
    ]
  },
  "keywords": {
    "value": [
      "Pendred; MLPA; DFNB4; \n          SLC26A4\n        ; FOXI1 and KCNJ10; Genotyping; Genetics; SNHL"
    ]
  },
  "author_name": {
    "value": [
      "Lynn M. Pique",
      "Marie-Luise Brennan",
      "Colin J. Davidson",
      "Frederick Schaefer",
      "John Greinwald Jr",
      "Iris Schrijver"
    ]
  }
}

bib.json looks like this (truncated):

{
  "title": "Mutation analysis of the SLC26A4, FOXI1 and KCNJ10 genes in individuals with congenital hearing loss",
  "link": [
    {
      "type": "fulltext_html",
      "url": "https://peerj.com/articles/384"
    },
    {
      "type": "fulltext_pdf",
      "url": "https://peerj.com/articles/384.pdf"
    },
    {
      "type": "fulltext_xml",
      "url": "/articles/384.xml"
    }
  ],
  "author": [
    {
      "name": "Lynn M. Pique",
      "institution": "Department of Pathology, Stanford University Medical Center, Stanford, CA, USA"
    },
    {
      "name": "Marie-Luise Brennan",
      "institution": "Department of Pediatrics, Stanford University Medical Center, Stanford, CA, USA"
    }
  ]
}

Contributing

We are not yet accepting contributions, if you'd like to help please drop me an email ([email protected]) and I'll let you know when we're ready for that.

Release History

  • 0.1.0 - initial version with simple one-element scraping
  • 0.1.1 - multiple-member elements; clean exiting; massive speedup
  • 0.1.2 - ability to grab text or HTML content of a selected node via special attributes text and html
  • 0.1.3 - refactor into modules, full logging suite, much more robust downloading
  • 0.1.4 - multiple URL processing, bug fixes, reduce dependency list
  • 0.1.5 - fix bug in bubbling logs up from PhantomJS
  • 0.1.6 - add dependency checking option
  • 0.1.7 - fix bug where jsdom rendered external resources (#10)
  • 0.2.0 - core moved out to separate library: thresher. PhantomJS and CasperJS binaries now managed through npm to simplify installation.
  • 0.2.1 - fix messy metadata
  • 0.2.3 - automatic scraper selection
  • 0.2.4-5 - bump thresher dependency for bug fixes
  • 0.2.6-7 - new Thresher API
  • 0.2.8 - fix Thresher API use
  • 0.3.0 - use Thresher 0.1.0 and scraperJSON 0.1.0
  • 0.3.1 - update the reported version number left out of last release
  • 0.3.2 - fix dependencies
  • 0.3.3-6 - bug fixes
  • 0.3.7 - fix bug in bibJSON dates. Bump to thresher 0.1.3.
  • 0.4.0 - fix various bugs (with urllists, tokenized urls), print help when run with no args, update all dependencies.
  • 0.4.1 - fix version number reporting.

License

Copyright (c) 2014 Shuttleworth Foundation Licensed under the MIT license.

More Repositories

1

getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
JavaScript
197
star
2

journal-scrapers

Journal scraper definitions for the ContentMine framework
Ruby
66
star
3

workshop-resources

This repository contains material helping you to set up a ContentMine workshop. It also includes tutorials for learning the ContentMine tools on your own.
37
star
4

norma

Convert XML/SVG/PDF into normalised, sectioned, scholarly HTML
HTML
36
star
5

scraperJSON

The scraperJSON standard for defining web scrapers as JSON objects
33
star
6

thresher

Headless scraperJSON scraping for Node.js
JavaScript
27
star
7

ami

HTML
13
star
8

FutureTDM

Materials of FutureTDM project
Jupyter Notebook
11
star
9

cm-crawlerd

ContentMine crawler daemon - this finds the latest articles in journals we mine, and stores them in our scraping queue
JavaScript
6
star
10

contentmine-app

The ContentMine ecosystem as a standalone app for OSX, Windows and Linux.
JavaScript
6
star
11

wikifactmine-api

The WikifactMine API Endpoint
JavaScript
5
star
12

cmbot

An autonomous bot for scraping the academic literature
JavaScript
5
star
13

canary

Canary is a UI to the contentmine tools getpapers, quickscrape, norma, and ami.
HTML
5
star
14

NCBI2wikidata

Go
5
star
15

old_site

The contentmine site, which (currently) includes the API
HTML
4
star
16

contentmine.github.io

ContentMine installation instructions website
HTML
4
star
17

canary-perch

ES Academic paper fact extraction - backend for canary
JavaScript
4
star
18

visualizations

Python
3
star
19

neuro

Neurophysiology, especially voltage traces
3
star
20

pyCProject

Provides basic function to read a ContentMine CProject and CTrees into python datastructures.
Python
3
star
21

vms

ContentMine virtual machines
3
star
22

node-journalTOCs

Node.js client for the JournalTOCs API
JavaScript
2
star
23

ebi_workshop_20141006

ContentMine workshop at EBI, October 6th 2014
HTML
2
star
24

scripts

Shell and Python scripts for utility activities
HTML
2
star
25

cm-ucl

A repository to openly track progress on table extraction.
HTML
2
star
26

releases

Release packages for ContentMine projects
Shell
2
star
27

wikibase

Simple golang library for interfacing with wikibase.
Go
2
star
28

workshops

General materials for workshops
2
star
29

Chicago-20141114

ContentMine workshop in Chicago (US), November 14th 2014
2
star
30

nhtml

NHTML is a normalization of scholarly documents from {PDF, HTML, XML, SVG, PNG} into a single semantic format
Java
2
star
31

ScienceSourceReview

Go
1
star
32

CMServices

Web services layer for ContentMine text and data mining tools and utilities
JavaScript
1
star
33

dictionaries

Dictionaries for use with `ami` , including some management software
HTML
1
star
34

vt-open-data-week

Virginia Tech workshop
Jupyter Notebook
1
star
35

pdf2svg

ContentMine Fork of the WWMM pdf2svg Package
Java
1
star
36

contentmine.org

The static site
HTML
1
star
37

imageanalysis

ContentMine Fork of the WWMM imageanalysis Package
HTML
1
star
38

pyCMine

Python scripts for downstream analyses of content mine extracted facts, mostly comming from pyCProject
1
star
39

cephis

Document processing including support libraries and PDFBox2
1
star
40

cm-uclii

Data and progress tracking for table extraction and semantically guided content enhancement
HTML
1
star
41

tilburg

Extraction of data from Vector-based Funnel Plots in the scholarly literature
Shell
1
star
42

JISC-Workshop-1Dec2014

Workshop resources for one day workshop at JISC on 1 Dec 2014
1
star
43

2015-11-07-mozfest15

Python
1
star
44

ijsem

Computational results of PLUTo ami-phylo analysis of trees from Int. J. Syst. Evol. Microbiol.
HTML
1
star
45

amidemos

HTML
1
star
46

contentmine-gui

GUI for executing ContentMine commands - browser SPA for running locally on user's machine.
JavaScript
1
star