• Stars
    star
    107
  • Rank 312,276 (Top 7 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created almost 8 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automatically extracts structured information from webpages

Web Auto Extractor

Build Status

Parse semantically structured information from any HTML webpage.

Supported formats:-

  • Encodings that support Schema.org vocabularies:-
    • Microdata
    • RDFa-lite
    • JSON-LD
  • Random Meta tags

Popularly, many websites mark up their webpages with Schema.org vocabularies for better SEO. This library helps you parse that information to JSON.

Demo it on tonicdev

Installation

npm install web-auto-extractor

Usage

// IF CommonJS
var WAE = require('web-auto-extractor').default
// IF ES6
import WAE from 'web-auto-extractor'

var parsed = WAE().parse(sampleHTML)

Let's use the following text as the sampleHTML in our example. It uses Schema.org vocabularies to structure a Product information and is encoded in microdata format.

Input

<div itemscope itemtype="http://schema.org/Product">
  <span itemprop="brand">ACME</span>
  <span itemprop="name">Executive Anvil</span>
  <img itemprop="image" src="anvil_executive.jpg" alt="Executive Anvil logo" />
  <span itemprop="description">Sleeker than ACME's Classic Anvil, the
    Executive Anvil is perfect for the business traveler
    looking for something to drop from a height.
  </span>
  Product #: <span itemprop="mpn">925872</span>
  <span itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
    <span itemprop="ratingValue">4.4</span> stars, based on <span itemprop="reviewCount">89
      </span> reviews
  </span>

  <span itemprop="offers" itemscope itemtype="http://schema.org/Offer">
    Regular price: $179.99
    <meta itemprop="priceCurrency" content="USD" />
    $<span itemprop="price">119.99</span>
    (Sale ends <time itemprop="priceValidUntil" datetime="2020-11-05">
      5 November!</time>)
    Available from: <span itemprop="seller" itemscope itemtype="http://schema.org/Organization">
                      <span itemprop="name">Executive Objects</span>
                    </span>
    Condition: <link itemprop="itemCondition" href="http://schema.org/UsedCondition"/>Previously owned,
      in excellent condition
    <link itemprop="availability" href="http://schema.org/InStock"/>In stock! Order now!</span>
  </span>
</div>

Output

Our parsed object should look like -

{
  "microdata": {
    "Product": [
      {
        "@context": "http://schema.org/",
        "@type": "Product",
        "brand": "ACME",
        "name": "Executive Anvil",
        "image": "anvil_executive.jpg",
        "description": "Sleeker than ACME's Classic Anvil, the\n    Executive Anvil is perfect for the business traveler\n    looking for something to drop from a height.",
        "mpn": "925872",
        "aggregateRating": {
          "@context": "http://schema.org/",
          "@type": "AggregateRating",
          "ratingValue": "4.4",
          "reviewCount": "89"
        },
        "offers": {
          "@context": "http://schema.org/",
          "@type": "Offer",
          "priceCurrency": "USD",
          "price": "119.99",
          "priceValidUntil": "5 November!",
          "seller": {
            "@context": "http://schema.org/",
            "@type": "Organization",
            "name": "Executive Objects"
          },
          "itemCondition": "http://schema.org/UsedCondition",
          "availability": "http://schema.org/InStock"
        }
      }
    ]
  },
  "rdfa": {},
  "jsonld": {},
  "metatags": {
    "priceCurrency": [
      "USD",
      "USD"
    ]
  }
}

The parsed object includes four objects - microdata, rdfa, jsonld and metatags. Since the above HTML does not have any information encoded in rdfa and jsonld, those two objects are empty.

Caveat

I wouldn't call it a caveat but rather the parser is strict by design. It might not parse like expected if the HTML isn't encoded correctly, so one might assume the parser is broken.

For example, take the following HTML snippet.

<div itemscope itemtype="http://schema.org/Movie">
  <h1 itemprop="name">Ghostbusters</h1>
  <div itemprop="productionCompany" itemscope itemtype="http://schema.org/Organization">Black Rhino</div>
  <div itemprop="countryOfOrigin" itemscope itemtype="http://schema.org/Country">
    Country: <span itemprop="name" content="USA">United States</span><p>
  </div>
</div>

The problem here is the itemprop - productionCompany which is of itemtype - Organization doesn't have any itemprop as its children, in this case - name.

The parser assumes every itemtype contains an itemprop, or every typeof contains a property in case of rdfa. So the "Black Rhino" information is lost.

It'll be nice to fix this by having a non-strict mode for parsing this information. PRs are welcome.

License

MIT

More Repositories

1

whatthelang

Lightning Fast Language Prediction 🚀
Python
161
star
2

aws-maintenance-lambda

A lambda function to send alerts (to Slack, HipChat) on AWS maintenance events.
JavaScript
133
star
3

schemer

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.
Scala
112
star
4

matsya

Place ASGs on the right Spot Market
Scala
39
star
5

gocd-s3-artifacts

Set of GoCD plugins to publish and fetch artifacts from Amazon S3
Java
36
star
6

formland

A simple, super-flexible, extensible config based form generator for React.
TypeScript
33
star
7

mlflow-gocd

GoCD plugins to work with MLFlow as model repository in a CD flow
Java
28
star
8

sparkplug

Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌
Scala
28
star
9

css-optimum-selector

Helps to extract shortest optimal css-selector and multi-selector.
CSS
26
star
10

gocd-mesos

Autoscale GOCD agents on top of a mesos cluster
Scala
16
star
11

kafkajs-lz4

🗜 LZ4 compression codec for KafkaJS
TypeScript
14
star
12

javascript-easy-object

Now easily access or modify an object in javascript with javascript-easy-object.
JavaScript
13
star
13

vasuki

Scale GoCD Agents on demand with Docker
Go
13
star
14

rocks

RocksDB Ops CLI
Go
11
star
15

indix-radar

Indix Tech Radar
JavaScript
10
star
16

vamana

Autoscaling toolkit based on custom Application Metrics
Scala
9
star
17

terraform-aws-maintenance-lambda

Terraform module to deploy aws-maintenance-lambda - A lambda function to send alerts (to Slack, HipChat) on AWS maintenance events.
HCL
7
star
18

indix.github.io

Indix Open Source website
HTML
6
star
19

utils

Scala utils for anything and everything
Scala
5
star
20

indix-api-java

Indix API Java client
Java
4
star
21

indix-api-nodejs

Indix API NodeJS Client
JavaScript
4
star
22

bubblewrap

Asynchronous crawler utils
HTML
3
star
23

ml2npy

Export spark ml SparseVectors as numpy csr matrix
Scala
3
star
24

crawler4j

crawler4j fork from Google code
Java
2
star
25

marathon-logger

Push marathon app logs to local syslog daemon
Go
2
star
26

the-vision

Reusable react components
HTML
2
star
27

auto-tag-s3-bucket

Automatically tag S3 buckets with tags from a Google Spreadsheet
Python
2
star
28

indix-api-ruby

Ruby client for indix API
Ruby
2
star
29

openvpn-ops

This repo can be used to create a openvpn server.
Shell
1
star
30

abelwatch

Alerting tool on top of Abel
Go
1
star
31

hadoop-as-publisher

Hadoop Autoscaling Metric Publisher
Python
1
star
32

ansible-ruby

Ansible role to install rvm and ruby
1
star
33

mod_evasive

git mirror of mod_evasive apache module from http://www.zdziarski.com/blog/wp-content/uploads/2010/02/mod_evasive_1.10.1.tar.gz
C
1
star
34

rocksdb-io

hadoop formats, cascading tap and scalding sources for RocksDB
Scala
1
star
35

ansible-telegraf

Ansible role to install telegraf
1
star
36

abel

Business stats collection/aggregation
Scala
1
star
37

ansible-monit

Ansible role for monit
1
star
38

Mobile-Product-Search

This app is a representation of how Indix API can be used to leverage information on stores, brands, products which forms the skeleton of Retail Industry.
Objective-C
1
star