• Stars
    star
    5,062
  • Rank 8,221 (Top 0.2 %)
  • Language
    JavaScript
  • License
    Apache License 2.0
  • Created about 8 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

πŸ“œ Extract meaningful content from the chaos of a web page

Postlight Parser - Extracting content from chaos

CircleCI Greenkeeper badge Apache License MITC License Gitter chat

Postlight's Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.

Postlight Parser powers Postlight Reader, a browser extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.

Postlight Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are many examples available along with documentation.

How? Like this.

Installation

# If you're using yarn
yarn add @postlight/parser

# If you're using npm
npm install @postlight/parser

Usage

import Parser from '@postlight/parser';

Parser.parse(url).then(result => console.log(result));

// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Parser.parse()` to parse the current page.

The result looks like this:

{
  "title": "Thunder (mascot)",
  "content": "... <p><b>Thunder</b> is the <a href=\"https://en.wikipedia.org/wiki/Stage_name\">stage name</a> for the...",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

If Parser is unable to find a field, that field will return null.

parse() Options

Content Formats

By default, Postlight Parser returns the content field as HTML. However, you can override this behavior by passing in options to the parse function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are 'html', 'markdown', and 'text'). For example:

Parser.parse(url, { contentType: 'markdown' }).then(result =>
  console.log(result)
);

This returns the the page's content as GitHub-flavored Markdown:

"content": "...**Thunder** is the [stage name](https://en.wikipedia.org/wiki/Stage_name) for the..."
Custom Request Headers

You can include custom headers in requests by passing name-value pairs to the parse function as follows:

Parser.parse(url, {
  headers: {
    Cookie: 'name=value; name2=value2; name3=value3',
    'User-Agent':
      'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1',
  },
}).then(result => console.log(result));
Pre-fetched HTML

You can use Postlight Parser to parse custom or pre-fetched HTML by passing an HTML string to the parse function as follows:

Parser.parse(url, {
  html:
    '<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
}).then(result => console.log(result));

Note that the URL argument is still supplied, in order to identify the web site and use its custom parser, if it has any, though it will not be used for fetching content.

The command-line parser

Postlight Parser also ships with a CLI, meaning you can use it from your command line like so:

Postlight Parser CLI Basic Usage

# Install Postlight Parser globally
yarn global add @postlight/parser
#   or
npm -g install @postlight/parser

# Then
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source

# Pass optional --format argument to set content type (html|markdown|text)
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown

# Pass optional --header.name=value arguments to include custom headers in the request
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie="name=value; name2=value2; name3=value3" --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"

# Pass optional --extend argument to add a custom type to the response
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend credit="p:last-child em"

# Pass optional --extend-list argument to add a custom type with multiple matches
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list categories=".meta__tags-list a"

# Get the value of attributes by adding a pipe to --extend or --extend-list
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list links=".body a|href"

# Pass optional --add-extractor argument to add a custom extractor at runtime.
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --add-extractor ./src/extractors/fixtures/postlight.com/index.js

License

Licensed under either of the below, at your preference:

Contributing

For details on how to contribute to Postlight Parser, including how to write a custom content extractor for any site, see CONTRIBUTING.md

Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.


πŸ”¬ A Labs project from your friends at Postlight. Happy coding!

More Repositories

1

headless-wp-starter

πŸ”ͺ WordPress + React Starter Kit: Spin up a WordPress-powered React app in one step
JavaScript
4,367
star
2

awesome-cms

πŸ“š A collection of open and closed source Content Management Systems (CMS) for your perusal.
2,711
star
3

serverless-typescript-starter

πŸ—„πŸ™…β€β™€οΈ Deploy your next serverless JavaScript function in seconds
TypeScript
713
star
4

lux

Build scalable, Node.js-powered REST JSON APIs with almost no code.
JavaScript
571
star
5

liftoff

πŸš€ Liftoff is a flexible static-site generator that pulls content from Airtable
JavaScript
343
star
6

parser-api

πŸš€ A drop-in replacement for the Postlight Parser API.
JavaScript
280
star
7

trimmings

🌲 Get back to HTML.
JavaScript
221
star
8

nodejs-typescript-kit

πŸ›  Node.js + TypeScript with all the goods: A zero-to-coding starter kit with all the modern tooling baked in.
JavaScript
107
star
9

account

πŸ“šοΈ βž• πŸ”’ Tell little stories with numbers
JavaScript
107
star
10

cloudflare-worker-app-kit

☁✨ A handy set of tools for creating a Cloudflare Worker app.
JavaScript
85
star
11

glide

☁ 🎑Modernize Salesforce API access with GraphQL
TypeScript
77
star
12

react-google-sheet-to-chart

πŸ“Š React component that renders Google Sheets as attractive charts with minimum effort
JavaScript
63
star
13

wp-callisto-migrator

🌐 πŸ‘‰ πŸ“‹ Migrate any content to WordPress in a few clicks
PHP
33
star
14

robo-chart-web

πŸ“Š Transform Google sheets to pretty charts!
JavaScript
27
star
15

lorem-ipsum-generator-generator

🎰 Generate a lorem ipsum generator site using Mercury Web Parser
HTML
26
star
16

secretmsg

πŸ•΅ Encrypt messages for easy sharing
TypeScript
23
star
17

generate-awesome

πŸ–¨ A command-line tool for generating Awesome Lists from a set of data files.
JavaScript
22
star
18

mercury-rs

The official Rust client for the Mercury Parser
Rust
16
star
19

ci-failed-test-reporter

πŸ“ A tool for posting failing test results to GitHub PRs
JavaScript
10
star
20

hubot-spotify-playlist

Allows the ability to add/remove/findTracks to a Spotify Playlist.
CoffeeScript
7
star
21

docker-lux

The official Docker image for Lux 🐳 πŸ”†
JavaScript
7
star
22

parser-api-express

Postlight Parser API express app
JavaScript
6
star
23

babel-preset-lux

A babel preset containing all of the plugins required by Lux.
JavaScript
6
star
24

lux-benchmarks

JavaScript
5
star
25

rollup-plugin-lux

A Rollup plugin for bundling Lux applications.
JavaScript
3
star
26

lux-rfcs

RFCs for changes to Lux
2
star
27

use-search-params

A simple react hook for query params.
TypeScript
2
star
28

seasons

πŸŒ” Calculates the astronomical season for a given date or year
TypeScript
1
star
29

hubot-pingboard

πŸ‘₯ A hubot script for interacting with Pingboard.com.
CoffeeScript
1
star