• Stars
    star
    559
  • Rank 77,880 (Top 2 %)
  • Language
    JavaScript
  • Created over 4 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scrape websites for text by CSS selector.

Web Scraper

Web Scraper makes it effortless to scrape websites. You provide a URL and CSS selector and it will return you JSON containing the text contents of the matching elements. You can also scrape HTML attribute values by optionally specifying an attribute name.

Website →

Deploy to Cloudflare Workers

Examples

Heading from example.com

web.scraper.workers.dev/?url=example.com&selector=h1

{"result":["Example Domain"]}

Profile details from github.com profile page

web.scraper.workers.dev/?url=github.com/adamschwartz&selector=.vcard-fullname,.d-md-block+[itemprop=worksFor],.d-md-block+[itemprop=homeLocation]&pretty=true

{
  "result": {
    ".vcard-fullname": [
      "Adam Schwartz"
    ],
    ".d-md-block [itemprop=worksFor]": [
      "@cloudflare"
    ],
    ".d-md-block [itemprop=homeLocation]": [
      "Boston, MA"
    ]
  }
}

Random quote/author from quotes.net

web.scraper.workers.dev/?url=quotes.net/random.php&selector=%23disp-quote-body,.author&pretty=true

{
  "result": {
    "#disp-quote-body": [
      "We are advertis'd by our loving friends."
    ],
    ".author": [
      "William Shakespeare"
    ]
  }
}

API

  • Requests are made as GET against https://web.scraper.workers.dev.
  • There are two required query params, url and selector.
  • There are three optional query params, attr, pretty and spaced.
https://web.scraper.workers.dev
  ?url=https://example.com
  &selector=p
  &attr=title
  &pretty=true
  &spaced=true

How it works

If at least url and selector are set, the response value will always be JSON.

If only one node is found on the page matching the selector, the result will be a string. If more than one node is found, the result will be an array of strings.

If an attr is provided, the result will be a string matching only the first node found which has a non-empty value for that HTML attribute.

Query params

url (required)

  • Supports https:// and http:// protocols.
  • If a protocol isn’t found, http:// is prepended.
    • e.g. https://web.scraper.workers.dev/?url=example.com&selector=p

selector (required)

  • Supports the same set of CSS selectors as Cloudflare Workers' HTMLRewriter class
    • As of Oct 10, 2019, this includes:
      • * – any element
      • E – any element of type E
      • E:not(s) – an E element that does not match either compound selector s
      • E.warning – an E element belonging to the class warning
      • E#myid – an E element with ID equal to myid.
      • E[foo] – an E element with a foo attribute
      • E[foo="bar"] – an E element whose foo attribute value is exactly equal to bar
      • E[foo="bar" i] – an E element whose foo attribute value is exactly equal to any (ASCII-range) case-permutation of bar
      • E[foo="bar" s] – an E element whose foo attribute value is exactly and case-sensitively equal to bar
      • E[foo~="bar"] – an E element whose foo attribute value is a list of whitespace-separated values, one of which is exactly equal to bar
      • E[foo^="bar"] – an E element whose foo attribute value begins exactly with the string bar
      • E[foo$="bar"] – an E element whose foo attribute value ends exactly with the string bar
      • E[foo*="bar"] – an E element whose foo attribute value contains the substring bar
      • E[foo|="en"] – an E element whose foo attribute value is a hyphen-separated list of values beginning with en
      • E F – an F element descendant of an E element
      • E > F – an F element child of an E element
  • Supports multiple selectors delimited with a comma.

attr (optional)

  • When attr is not set, the text contents of all matched nodes are returned.
  • When attr is set, that HTML attribute is scraped from the first matching node with a non-empty value for that attribute.

pretty (optional)

  • When false or not included, JSON is minified.
  • When true, JSON is formatted using JSON.stringify(json, null, 2).

spaced (optional)

  • When false or not included, the text nodes of children of the nodes matching selector will be concatenated raw.
  • When true, a single space character is added after the end tag of each child node found.
Examples

Consider the following DOM structure:

<div><p>This is the first paragraph.</p><p>This is another paragraph.</p></div>

If the selector is set to match div, by default the resulting text will be:

This is the first paragraph.This is another paragraph.

This is because there is no space character between </p> and <p>.

With spaced set to true, the result is:

This is the first paragraph. This is another paragraph.

Development

Web Scraper is powered by Cloudflare Workers, heavily utilizing HTMLRewriter for parsing the HTML and scraping the text.

To develop Web Scraper locally, pull down the repo, and follow these steps:

  1. Installing the Workers CLI globally.
npm i @cloudflare/[email protected] -g
  1. Run the preview/watcher inside the repo:
wrangler preview --watch

This will open up the Workers preview experience, so you can test and debug the site. The main source can be found in index.js. As you make changes you’ll see them live in the previewer.

Deploying

Web Scraper is deployed automatically when changes are pushed to master using a GitHub Action and the Workers CLI.

Author

Web Scraper was created by Adam Schwartz.

More Repositories

1

magic-of-css

A CSS course to turn you into a magician.
CSS
6,630
star
2

log

Console.log with style.
HTML
3,011
star
3

chrome-tabs

Chrome-style tabs in HTML/CSS.
JavaScript
1,696
star
4

chrome-inspector-detector

Detect if the Chrome Inspector is open or closed.
JavaScript
324
star
5

generate.invoice.workers.dev

Open-source tool for generate PDF invoices with Cloudflare Workers.
JavaScript
72
star
6

lazy.invoice.workers.dev

Lazily generate PDF using Cloudflare Workers.
JavaScript
70
star
7

letters

Draw non-overlapping text on a canvas
HTML
67
star
8

chrome-new-tab

New Tab page for Google Chrome
JavaScript
26
star
9

github-markdown-kitchen-sink

26
star
10

chrome-chromeless

Chromeless Chrome
JavaScript
21
star
11

worker-generate-invoice-pdf

Generate an invoice PDF on the fly with Cloudflare Workers
JavaScript
20
star
12

sci-fi-coder

The text effect used in Sci-Fi films.
JavaScript
12
star
13

chrome-console-grapher

Draw bar charts in the Chrome Developer Tools console.
JavaScript
12
star
14

get.svg.workers.dev

JavaScript
10
star
15

chrome-desaturate-favicons

Desaturate the favicons of all inactive tabs.
JavaScript
8
star
16

paste

Paste or drop anything
JavaScript
7
star
17

weather

HTML
4
star
18

REDLOVE

A border-radius font.
CSS
4
star
19

midigame

JavaScript
4
star
20

threedeeworld

Make any website feel a little 3d
JavaScript
4
star
21

chords

JavaScript
4
star
22

my-style

My Style is a Google Chrome extension that allows you to insert custom CSS into pages, immediately see the visual results, and have that CSS persist for future visits.
JavaScript
4
star
23

textual

A library for creating super-minimal contextual form elements.
CSS
3
star
24

face

My face
3
star
25

chrome-style

Style the internet how you want.
JavaScript
3
star
26

docs-engine-example

JavaScript
3
star
27

playground

HTML
3
star
28

focus-visible-polyfill

A modified version of https://github.com/WICG/focus-visible
JavaScript
2
star
29

mimic

Keep a copy of a DOM element in sync with an original
HTML
2
star
30

noise

Full color noise. (+ fun)
2
star
31

terminal.js

terminal.js — a terrible idea
JavaScript
2
star
32

netflix-pause-indefinitely

Chrome extension to better handle Netflix’s “Playback Timed Out” errors.
JavaScript
2
star
33

example-site-editor

Example website using: https://github.com/adamschwartz/site-editor
HTML
2
star
34

tufte-d3-examples

Some D3 examples
JavaScript
2
star
35

eager-chrome-extension

JavaScript
1
star
36

site-editor

See an example: https://github.com/adamschwartz/example-site-editor
JavaScript
1
star
37

typography.js

HTML
1
star
38

real-touch-messenger

A JS clone of WATCH’s Real Touch Messenger
JavaScript
1
star
39

devices

JavaScript
1
star
40

atom

My ~/.atom files
CSS
1
star
41

et-notebooks

Resources for ET Notebooks
HTML
1
star
42

giphy

Website version of Giphy's screensavers (http://giphy.com/screensavers).
HTML
1
star
43

EagerAppTestingShell

A shell for testing Eager apps.
1
star
44

responsive-url

https://adamschwartz.co/responsive-url/test/
JavaScript
1
star
45

commands

JavaScript
1
star
46

set-interval-visible

`setInterval` that pauses when the page isn’t visible
HTML
1
star
47

refineslide

A 3D transform/CSS transition-enabled jQuery plugin for displaying responsive, image-based content with shiny animations.
JavaScript
1
star