• Stars
    star
    2,300
  • Rank 20,024 (Top 0.4 %)
  • Language
    HTML
  • License
    MIT License
  • Created over 8 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.


metascraper

Last version Coverage Status NPM Status

A library to easily get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.

What is it

The metascraper library allows you to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.

It follows a few principles:

  • Have a high accuracy for online articles by default.
  • Make it simple to add new rules or override existing ones.
  • Don't restrict rules to CSS selectors or text accessors.

Getting started

Let's extract accurate information from the following website:

First, metrascraper expects you provide the HTML markup behind the target URL.

There are multiple ways to get the HTML markup. In our case, we are going to run a programmatic headless browser to simulate real user navigation, so the data obtained will be close to a real-world example.

const getHTML = require('html-get')

/**
 * `browserless` will be passed to `html-get`
 * as driver for getting the rendered HTML.
 */
const browserless = require('browserless')()

const getContent = async url => {
  // create a browser context inside the main Chromium process
  const browserContext = browserless.createContext()
  const promise = getHTML(url, { getBrowserless: () => browserContext })
  // close browser resources before return the result
  promise.then(() => browserContext).then(browser => browser.destroyContext())
  return promise
}

/**
 * `metascraper` is a collection of tiny packages,
 * so you can just use what you actually need.
 */
const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

/**
 * The main logic
 */
getContent('https://microlink.io')
  .then(metascraper)
  .then(metadata => console.log(metadata))
  .then(browserless.close)
  .then(process.exit)

The output will be something like:

{
  "author": "Microlink HQ",
  "date": "2022-07-10T22:53:04.856Z",
  "description": "Enter a URL, receive information. Normalize metadata. Get HTML markup. Take a screenshot. Identify tech stack. Generate a PDF. Automate web scraping. Run Lighthouse",
  "image": "https://cdn.microlink.io/logo/banner.jpeg",
  "logo": "https://cdn.microlink.io/logo/trim.png",
  "publisher": "Microlink",
  "title": "Turns websites into data โ€” Microlink",
  "url": "https://microlink.io/"
}

What data it detects

Note: Custom metadata detection can be defined using a rule bundle.

Here is an example of the metadata that metascraper can detect:

  • audio โ€” e.g. https://cf-media.sndcdn.com/U78RIfDPV6ok.128.mp3
    A audio URL that best represents the article.

  • author โ€” e.g. Noah Kulwin
    A human-readable representation of the author's name.

  • date โ€” e.g. 2016-05-27T00:00:00.000Z
    An ISO 8601 representation of the date the article was published.

  • description โ€” e.g. Venture capitalists are raising money at the fastest rate...
    The publisher's chosen description of the article.

  • video โ€” e.g. https://assets.entrepreneur.com/content/preview.mp4
    A video URL that best represents the article.

  • image โ€” e.g. https://assets.entrepreneur.com/content/3x2/1300/20160504155601-GettyImages-174457162.jpeg
    An image URL that best represents the article.

  • lang โ€” e.g. en
    An ISO 639-1 representation of the url content language.

  • logo โ€” e.g. https://entrepreneur.com/favicon180x180.png
    An image URL that best represents the publisher brand.

  • publisher โ€” e.g. Fast Company
    A human-readable representation of the publisher's name.

  • title โ€” e.g. Meet Wall Street's New A.I. Sheriffs
    The publisher's chosen title of the article.

  • url โ€” e.g. http://motherboard.vice.com/read/google-wins-trial-against-oracle-saves-9-billion
    The URL of the article.

How it works

metascraper is built out of rules bundles.

It was designed to be easy to adapt. You can compose your own transformation pipeline using existing rules or write your own.

Rules bundles are a collection of HTML selectors around a determinate property. When you load the library, implicitly it is loading core rules.

Each set of rules load a set of selectors in order to get a determinate value.

These rules are sorted with priority: The first rule that resolve the value successfully, stop the rest of rules for get the property. Rules are sorted intentionally from specific to more generic.

Rules work as fallback between them:

  • If the first rule fails, then it fallback in the second rule.
  • If the second rule fails, time to third rule.
  • etc

metascraper do that until finish all the rule or find the first rule that resolves the value.

Importing rules

metascraper exports a constructor that need to be initialized providing a collection of rules to load:

const metascraper = require('metascraper')([
  require('metascraper-author')(),
  require('metascraper-date')(),
  require('metascraper-description')(),
  require('metascraper-image')(),
  require('metascraper-logo')(),
  require('metascraper-clearbit')(),
  require('metascraper-publisher')(),
  require('metascraper-title')(),
  require('metascraper-url')()
])

Again, the order of rules are loaded are important: Just the first rule that resolve the value will be applied.

Use the first parameter to pass custom options specific per each rules bundle:

const metascraper = require('metascraper')([
  require('metascraper-clearbit')({
    size: 256,
    format: 'jpg'
  })
])

Rules bundles

?> Can't find the rules bundle that you want? Let's open an issue to create it.

Official

Rules bundles maintained by metascraper maintainers.

Core essential

Vendor specific

Community

Rules bundles maintained by individual users.

See CONTRIBUTING for adding your own module!

API

constructor(rules)

Create a new metascraper instance declaring the rules bundle to be used explicitly.

rules

Type: Array

The collection of rules bundle to be loaded.

metascraper(options)

Call the instance for extracting content based on rules bundle provided at the constructor.

options

url

Required
Type: String

The URL associated with the HTML markup.

It is used for resolve relative links that can be present in the HTML markup.

it can be used as fallback field for different rules as well.

html

Type: String

The HTML markup for extracting the content.

rules

Type: Array

You can pass additional rules to add on execution time.

These rules will be merged with your loaded rules at the beginning.

validateUrl

Type: boolean
Default: true

Ensure the URL provided is validated as a WHATWG URL API compliant.

Benchmark

To give you an idea of how accurate metascraper is, here is a comparison of similar libraries:

Library metascraper html-metadata node-metainspector open-graph-scraper unfluff
Correct 95.54% 74.56% 61.16% 66.52% 70.90%
Incorrect 1.79% 1.79% 0.89% 6.70% 10.27%
Missed 2.68% 23.67% 37.95% 26.34% 8.95%

A big part of the reason for metascraper's higher accuracy is that it relies on a series of fallbacks for each piece of metadata, instead of just looking for the most commonly-used, spec-compliant pieces of metadata, like Open Graph.

metascraper's default settings are targetted specifically at parsing online articles, which is why it's able to be more highly-tuned than the other libraries for that purpose.

If you're interested in the breakdown by individual pieces of metadata, check out the full comparison summary, or dive into the raw result data for each library.

License

metascraper ยฉ Microlink, released under the MIT License.
Authored and maintained by Microlink with help from contributors.

microlink.io ยท GitHub microlinkhq ยท Twitter @microlinkhq

More Repositories

1

browserless

browserless is an efficient way to interact with a headless browser built in top of Puppeteer.
JavaScript
1,393
star
2

unavatar

Get unified user avatar from social networks, including Instagram, SoundCloud, Telegram, Twitter, YouTube & more.
JavaScript
946
star
3

sdk

Make any URL embeddable. Turn any URL into a beautiful link preview.
HTML
583
star
4

keyvhq

Simple key-value storage with support for multiple backends.
JavaScript
422
star
5

cards

The easiest way to create and share dynamic images at scale.
JavaScript
389
star
6

youtube-dl-exec

A simple Node.js wrapper for youtube-dl/yt-dlp.
JavaScript
316
star
7

async-ratelimiter

Rate limit made simple, easy, async.
JavaScript
298
star
8

react-json-view

JSON viewer for React
JavaScript
188
star
9

www

Browser as API
JavaScript
120
star
10

splashy

Given an whatever image (GIF, PNG, WebP, AVIF, etc) extract predominant & palette colors.
JavaScript
88
star
11

spotify-url-info

Get metadata from any Spotify URL.
JavaScript
68
star
12

html-get

Get the HTML from any website, using prerendering when necessary.
JavaScript
65
star
13

mql

Microlink Query Language. The official HTTP client to interact with Microlink API for Node.js, browsers & Deno.
JavaScript
47
star
14

nanoclamp

๐Ÿ—œResponsive clamping component for React in 735 bytes.
JavaScript
41
star
15

metatags

Ensure your HTML is previewed beautifully across social networks.
JavaScript
29
star
16

async-memoize-one

memoize the last result, in async way.
JavaScript
21
star
17

recipes

JavaScript
15
star
18

oembed-spec

A parser for oEmbed specification.
JavaScript
14
star
19

function

JavaScript Serverless functions with browser programmatic access.
JavaScript
11
star
20

server-proxy

Interact with Microlink API without exposing your credentials
JavaScript
9
star
21

queue

The high resilient queue for processing URLs.
JavaScript
9
star
22

keyv-s3

Amazon S3 storage adapter for Keyv.
JavaScript
8
star
23

openkey

Fast authentication layer for your SaaS, backed by Redis.
JavaScript
7
star
24

cdn

Content Delivery Network for Microlink assets
JavaScript
6
star
25

analytics

Microservice to retrieve your CloudFlare Analytics.
JavaScript
6
star
26

keyv-redis

Redis storage adapter for Keyv.
JavaScript
6
star
27

ping-url

Fast DNS resolution caching results for a while.
JavaScript
6
star
28

lighthouse-viewer

Lighthouse Viewer as service
JavaScript
6
star
29

cli

A CLI for interacting with Microlink API
JavaScript
5
star
30

geolocation

Get detailed information about the incoming request based on the IP address.
JavaScript
5
star
31

oss

Microservice to get the latest public GitHub repos from a user/organization
JavaScript
4
star
32

local

Runs Microlink Function locally.
JavaScript
4
star
33

html

Get HTML from any URL.
JavaScript
3
star
34

youtube-dl-binary

Tiny tool for downloading the latest `youtube-dl` version available.
JavaScript
3
star
35

open

3
star
36

mql-cli

CLI for interacting with Microlink Query Language.
JavaScript
2
star
37

healthcheck

Microservice to retrieve your CloudFlare Health Checks.
JavaScript
2
star
38

demo-links

A set of links used for demo purposes
2
star
39

meta

Open Graph Image as Service
TypeScript
2
star
40

proxy

Interact with Microlink API using an Edge Function.
JavaScript
2
star
41

logo

Adding logos to any website, powered by Microlink API.
JavaScript
2
star
42

microclap

clap button as service
JavaScript
1
star