graphql-scraper
GraphQL lets us query all sorts of graph-shaped data - so why not use it to query the world's most useful graph, the web?
graphql-scraper
is a command-line tool and reusable GraphQL schema which lets you easily extract data from HTML.
Check out a live demo here. You can easily spin up your own by using graphql-scraper-server
.
The command-line tool
npx graphql-scraper <query-file>
or
npm install -g graphql-scraper
graphql-scraper <query-file>
Reads a GraphQL query from the path query-file
, and prints the result.
If query-file
is not given, reads the query from stdin.
Command-line options
--json
Returns the result in JSON format, for use in other tools.--help
Prints a help string.
Variables
Any other named options you pass to the CLI will be used as a query variable.
For example, if you want to reuse the same query on several pages, you could write the following query file (query.graphql
):
query ExampleQueryWithVariable($page: String) {
page(url: $page) {
items: queryAll(selector: "tr.athing") {
rank: text(selector: "td span.rank")
title: text(selector: "td.title a")
sitebit: text(selector: "span.comhead a")
url: attr(selector: "td.title a", name: "href")
attrs: next {
score: text(selector: "span.score")
user: text(selector: "a:first-of-type")
comments: text(selector: "a:nth-of-type(3)")
}
}
}
}
...and execute the query like this:
graphql-scraper query.graphql --page="https://news.ycombinator.com/"
The schema
You can check out an auto-generated schema description here, but I recommend trying out the graphql-scraper-server example and exploring the types interactively. You can also play around with the schema in the live demo.
Re-using the schema in your own projects
The npm package exports the GraphQL schema which is used by the command-line tool. This an instance of graphql-js GraphQLSchema
, which you can use anywhere that expects a schema, for example apollo-server
or graphql-yoga
.
Use npm install graphql-scraper
or yarn add graphql-scraper
to add the schema to your project.
graphql
Basic example with import { graphql } from 'graphql'
import schema from 'graphql-scraper'
// You can also import it as follows:
// const schema = require('graphql-scraper')
const query = `
{
page(url: "http://news.ycombinator.com") {
items: queryAll(selector: "tr.athing") {
rank: text(selector: "td span.rank")
title: text(selector: "td.title a")
sitebit: text(selector: "span.comhead a")
url: attr(selector: "td.title a", name: "href")
attrs: next {
score: text(selector: "span.score")
user: text(selector: "a:first-of-type")
comments: text(selector: "a:nth-of-type(3)")
}
}
}
}
`
graphql(schema, query).then(response => {
console.log(response)
})
Background
This project was inspired by gdom
, which is written in Python and uses the Graphene GraphQL library.
If you want to switch over from gdom
, please note some schema changes:
query(selector: String!)
now only returns a singleElement
, rather than a list (likedocument.querySelector
). Added a newqueryAll(selector: String!): [Element]
field, which behaves likedocument.querySelectorAll
.is(selector: String!)
is renamed tohas(selector: String!)
.children
,parent
,siblings
,next
etc. no longer have aselector
argument. If you need to select children with a specific selector, use child selectors (.foo > .bar
).parents
is removed.prev[All]
is renamed toprevious[All]
.
Maintainers
Contribute
PRs accepted.
License
MIT ยฉ 2018 harry lachenmayer