• Stars
    star
    207
  • Rank 189,769 (Top 4 %)
  • Language
    TypeScript
  • Created almost 7 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extract structured data from the web using GraphQL.

graphql-scraper

GraphQL lets us query all sorts of graph-shaped data - so why not use it to query the world's most useful graph, the web?

graphql-scraper is a command-line tool and reusable GraphQL schema which lets you easily extract data from HTML.

Check out a live demo here. You can easily spin up your own by using graphql-scraper-server.

The command-line tool

npx graphql-scraper <query-file>

or

npm install -g graphql-scraper
graphql-scraper <query-file>

Reads a GraphQL query from the path query-file, and prints the result.

If query-file is not given, reads the query from stdin.

Command-line options

  • --json Returns the result in JSON format, for use in other tools.
  • --help Prints a help string.

Variables

Any other named options you pass to the CLI will be used as a query variable.

For example, if you want to reuse the same query on several pages, you could write the following query file (query.graphql):

query ExampleQueryWithVariable($page: String) {
  page(url: $page) {
    items: queryAll(selector: "tr.athing") {
      rank: text(selector: "td span.rank")
      title: text(selector: "td.title a")
      sitebit: text(selector: "span.comhead a")
      url: attr(selector: "td.title a", name: "href")
      attrs: next {
        score: text(selector: "span.score")
        user: text(selector: "a:first-of-type")
        comments: text(selector: "a:nth-of-type(3)")
      }
    }
  }
}

...and execute the query like this:

graphql-scraper query.graphql --page="https://news.ycombinator.com/"

The schema

You can check out an auto-generated schema description here, but I recommend trying out the graphql-scraper-server example and exploring the types interactively. You can also play around with the schema in the live demo.

Re-using the schema in your own projects

The npm package exports the GraphQL schema which is used by the command-line tool. This an instance of graphql-js GraphQLSchema, which you can use anywhere that expects a schema, for example apollo-server or graphql-yoga.

Use npm install graphql-scraper or yarn add graphql-scraper to add the schema to your project.

Basic example with graphql

import { graphql } from 'graphql'
import schema from 'graphql-scraper'
// You can also import it as follows:
// const schema = require('graphql-scraper')


const query = `
{
  page(url: "http://news.ycombinator.com") {
    items: queryAll(selector: "tr.athing") {
      rank: text(selector: "td span.rank")
      title: text(selector: "td.title a")
      sitebit: text(selector: "span.comhead a")
      url: attr(selector: "td.title a", name: "href")
      attrs: next {
        score: text(selector: "span.score")
        user: text(selector: "a:first-of-type")
        comments: text(selector: "a:nth-of-type(3)")
      }
    }
  }
}
`

graphql(schema, query).then(response => {
  console.log(response)
})

Background

This project was inspired by gdom, which is written in Python and uses the Graphene GraphQL library.

If you want to switch over from gdom, please note some schema changes:

  • query(selector: String!) now only returns a single Element, rather than a list (like document.querySelector). Added a new queryAll(selector: String!): [Element] field, which behaves like document.querySelectorAll.
  • is(selector: String!) is renamed to has(selector: String!).
  • children, parent, siblings, next etc. no longer have a selector argument. If you need to select children with a specific selector, use child selectors (.foo > .bar).
  • parents is removed.
  • prev[All] is renamed to previous[All].

Maintainers

@lachenmayer

Contribute

PRs accepted.

License

MIT ยฉ 2018 harry lachenmayer

More Repositories

1

hyperdb-authorization-guide

A deep dive into how authorization works in hyperdb.
JavaScript
53
star
2

arrowsmith

Augmented editor for Elm.
Haskell
46
star
3

dat-keychain-storage

Store your Dat archive's secret key in the macOS keychain.
JavaScript
21
star
4

midi-messages

A MIDI message encoder/decoder
JavaScript
20
star
5

graphql-scraper-server

Instantly spin up a graphql-scraper server (with 1-click deploy!)
JavaScript
15
star
6

vanity-dat

Create dat archives with a prefix of your choice.
JavaScript
10
star
7

p2p-slack-clone-poc

p2p chat app using hypercore - built at WeTransfer Innovation Day
TypeScript
6
star
8

ImperialHackathon

JavaScript
5
star
9

insta-fuzz

experiments in generating JPEG images with fuzzers.
Rust
5
star
10

random-access-key-value

Makes a random-access-storage instance out of any LevelDB-compatible key-value store.
JavaScript
5
star
11

hyperdb-explorer

An interactive CLI tool to explore the contents of a hyperdb.
JavaScript
5
star
12

hyperdb-storage

Use a hyperdb as a storage backend for Dat, or any other hyper*-structure (eg. hypercore, hyperdrive, hypertrie).
JavaScript
5
star
13

buffer-json-encoding

An abstract-encoding compatible JSON encoder/decoder that properly encodes/decodes buffers.
JavaScript
4
star
14

random-access-keychain

A random-access-storage implementation which stores its contents in the system keychain
JavaScript
4
star
15

hypermerge-playground

simplest possible hypermerge (automerge+hypercore) client
JavaScript
3
star
16

events

Your personalised event calendar at Imperial
CoffeeScript
3
star
17

FilePigeon

Fly under the clouds.
JavaScript
3
star
18

paraphraser

the sentence thesaurus
JavaScript
2
star
19

duffy

Rapidly find the largest files in a given directory. [duffy = du(1) + diffy]
JavaScript
2
star
20

Instant-Wikipedia

Hover over a link on Wikipedia to read its opening paragraph. (Chrome extension)
CoffeeScript
2
star
21

nanostate-graphviz

Proof of concept for visualizing a nanostate FSM as a graphviz graph.
JavaScript
2
star
22

ansible-raspberry-server

Ansible Playbooks to set up a Raspberry Pi server
Shell
2
star
23

dotfiles

~ harry's dotfiles ~
Shell
2
star
24

connect-xray

A proof-of-concept reactive architecture which allows inspecting all data flow within an app.
TypeScript
2
star
25

sp404

Error SP-404
HTML
1
star
26

react-native-android-fetch-head-repro

Repro project for broken HEAD requests in React Native 0.63.2 on Android.
Java
1
star
27

invoice

generate invoices for me
JavaScript
1
star
28

mimosa-minimal

Super-minimal Mimosa template for your super-quick hacks.
CoffeeScript
1
star
29

hackkings

CoffeeScript
1
star
30

treealog

an experiment in non-linear, asynchronous video communication in a peer-to-peer environment
JavaScript
1
star
31

random-access-rn-file-test

Test random-access-rn-file using random-access-test
JavaScript
1
star
32

unison

my unison playground
1
star
33

gif-game-of-life

Haskell
1
star
34

graphql-fragment-codegen

Generate fragments on GraphQL types that contain all the fields defined in each type
JavaScript
1
star
35

koka-reactive

Lab notebook: implementing functional reactive programming in Koka
1
star