• Stars
    star
    1,785
  • Rank 26,056 (Top 0.6 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created almost 10 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Find broken links, missing images, etc within your HTML.

broken-link-checker NPM Version Build Status Coverage Status Dependency Monitor

Find broken links, missing images, etc within your HTML.

  • Complete: Unicode, redirects, compression, basic authentication, absolute/relative/local URLs.
  • ⚡️ Fast: Concurrent, streamed and cached.
  • 🍰 Easy: Convenient defaults and very configurable.

Other features:

  • Support for many HTML elements and attributes; not only <a href> and <img src>.
  • Support for relative URLs with <base href>.
  • WHATWG specifications-compliant HTML and URL parsing.
  • Honor robot exclusions (robots.txt, headers and rel), optionally.
  • Detailed information for reporting and maintenance.
  • URL keyword filtering with simple wildcards.
  • Pause/Resume at any time.

Installation

Node.js >= 14 is required. There're two ways to use it:

Command Line Usage

To install, type this at the command line:

npm install broken-link-checker -g

After that, check out the help for available options:

blc --help

A typical site-wide check might look like:

blc http://yoursite.com -ro
# or
blc path/to/index.html -ro

Note: HTTP proxies are not directly supported. If your network is configured incorrectly with no resolution in sight, you could try using a container with proxy settings.

Programmatic API

To install, type this at the command line:

npm install broken-link-checker

The remainder of this document will assist you in using the API.

Classes

While all classes have been exposed for custom use, the one that you need will most likely be SiteChecker.

HtmlChecker

Scans an HTML document to find broken links. All methods from EventEmitter are available.

const {HtmlChecker} = require('broken-link-checker');

const htmlChecker = new HtmlChecker(options)
  .on('error', (error) => {})
  .on('html', (tree, robots) => {})
  .on('queue', () => {})
  .on('junk', (result) => {})
  .on('link', (result) => {})
  .on('complete', () => {});

htmlChecker.scan(html, baseURL);

Methods & Properties

  • .clearCache() will remove any cached URL responses.
  • .isPaused returns true if the internal link queue is paused and false if not.
  • .numActiveLinks returns the number of links with active requests.
  • .numQueuedLinks returns the number of links that currently have no active requests.
  • .pause() will pause the internal link queue, but will not pause any active requests.
  • .resume() will resume the internal link queue.
  • .scan(html, baseURL) parses & scans a single HTML document and returns a Promise. Calling this function while a previous scan is in progress will result in a thrown error. Arguments:
    • html must be either a Stream or a string.
    • baseURL must be a URL. Without this value, links to relative URLs will be given a BLC_INVALID reason for being broken (unless an absolute <base href> is found).

Events

  • 'complete' is emitted after the last result or zero results.
  • 'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
    • error is the Error.
  • 'html' is emitted after the HTML document has been fully parsed. Arguments:
    • tree is supplied by parse5.
    • robots is an instance of robot-directives containing any <meta> robot exclusions.
  • 'junk' is emitted on each skipped/unchecked link, as configured in options. Arguments:
  • 'link' is emitted with the result of each checked/unskipped link (broken or not). Arguments:
  • 'queue' is emitted when a link is internally queued, dequeued or made active.

HtmlUrlChecker

Scans the HTML content at each queued URL to find broken links. All methods from EventEmitter are available.

const {HtmlUrlChecker} = require('broken-link-checker');

const htmlUrlChecker = new HtmlUrlChecker(options)
  .on('error', (error) => {})
  .on('html', (tree, robots, response, pageURL, customData) => {})
  .on('queue', () => {})
  .on('junk', (result, customData) => {})
  .on('link', (result, customData) => {})
  .on('page', (error, pageURL, customData) => {})
  .on('end', () => {});

htmlUrlChecker.enqueue(pageURL, customData);

Methods & Properties

  • .clearCache() will remove any cached URL responses.
  • .dequeue(id) removes a page from the queue. Returns true on success or false on failure.
  • .enqueue(pageURL, customData) adds a page to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success. Arguments:
    • pageURL must be a URL.
    • customData is optional data (of any type) that is stored in the queue item for the page.
  • .has(id) returns true if the queue contains an active or queued page tagged with id and false if not.
  • .isPaused returns true if the queue is paused and false if not.
  • .numActiveLinks returns the number of links with active requests.
  • .numPages returns the total number of pages in the queue.
  • .numQueuedLinks returns the number of links that currently have no active requests.
  • .pause() will pause the queue, but will not pause any active requests.
  • .resume() will resume the queue.

Events

  • 'end' is emitted when the end of the queue has been reached.
  • 'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
    • error is the Error.
  • 'html' is emitted after a page's HTML document has been fully parsed. Arguments:
    • tree is supplied by parse5.
    • robots is an instance of robot-directives containing any <meta> and X-Robots-Tag robot exclusions.
    • response is the full HTTP response for the page, excluding the body.
    • pageURL is the URL to the current page being scanned.
    • customData is whatever was queued.
  • 'junk' is emitted on each skipped/unchecked link, as configured in options. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'link' is emitted with the result of each checked/unskipped link (broken or not) within the current page. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'page' is emitted after a page's last result, on zero results, or if the HTML could not be retrieved. Arguments:
    • error will be an Error if such occurred or null if not.
    • pageURL is the URL to the current page being scanned.
    • customData is whatever was queued.
  • 'queue' is emitted when a URL (link or page) is queued, dequeued or made active.

SiteChecker

Recursively scans (crawls) the HTML content at each queued URL to find broken links. All methods from EventEmitter are available.

const {SiteChecker} = require('broken-link-checker');

const siteChecker = new SiteChecker(options)
  .on('error', (error) => {})
  .on('robots', (robots, customData) => {})
  .on('html', (tree, robots, response, pageURL, customData) => {})
  .on('queue', () => {})
  .on('junk', (result, customData) => {})
  .on('link', (result, customData) => {})
  .on('page', (error, pageURL, customData) => {})
  .on('site', (error, siteURL, customData) => {})
  .on('end', () => {});

siteChecker.enqueue(siteURL, customData);

Methods & Properties

  • .clearCache() will remove any cached URL responses.
  • .dequeue(id) removes a site from the queue. Returns true on success or false on failure.
  • .enqueue(siteURL, customData) adds [the first page of] a site to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success. Arguments:
    • siteURL must be a URL.
    • customData is optional data (of any type) that is stored in the queue item for the site.
  • .has(id) returns true if the queue contains an active or queued site tagged with id and false if not.
  • .isPaused returns true if the queue is paused and false if not.
  • .numActiveLinks returns the number of links with active requests.
  • .numPages returns the total number of pages in the queue.
  • .numQueuedLinks returns the number of links that currently have no active requests.
  • .numSites returns the total number of sites in the queue.
  • .pause() will pause the queue, but will not pause any active requests.
  • .resume() will resume the queue.

Events

  • 'end' is emitted when the end of the queue has been reached.
  • 'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
    • error is the Error.
  • 'html' is emitted after a page's HTML document has been fully parsed. Arguments:
    • tree is supplied by parse5.
    • robots is an instance of robot-directives containing any <meta> and X-Robots-Tag robot exclusions.
    • response is the full HTTP response for the page, excluding the body.
    • pageURL is the URL to the current page being scanned.
    • customData is whatever was queued.
  • 'junk' is emitted on each skipped/unchecked link, as configured in options. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'link' is emitted with the result of each checked/unskipped link (broken or not) within the current page. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'page' is emitted after a page's last result, on zero results, or if the HTML could not be retrieved. Arguments:
    • error will be an Error if such occurred or null if not.
    • pageURL is the URL to the current page being scanned.
    • customData is whatever was queued.
  • 'queue' is emitted when a URL (link, page or site) is queued, dequeued or made active.
  • 'robots' is emitted after a site's robots.txt has been downloaded. Arguments:
  • 'site' is emitted after a site's last result, on zero results, or if the initial HTML could not be retrieved. Arguments:
    • error will be an Error if such occurred or null if not.
    • siteURL is the URL to the current site being crawled.
    • customData is whatever was queued.

Note: the filterLevel option is used for determining which links are recursive.

UrlChecker

Requests each queued URL to determine if they are broken. All methods from EventEmitter are available.

const {UrlChecker} = require('broken-link-checker');

const urlChecker = new UrlChecker(options)
  .on('error', (error) => {})
  .on('queue', () => {})
  .on('link', (result, customData) => {})
  .on('end', () => {});

urlChecker.enqueue(url, customData);

Methods & Properties

  • .clearCache() will remove any cached URL responses.
  • .dequeue(id) removes a URL from the queue. Returns true on success or false on failure.
  • .enqueue(url, customData) adds a URL to the queue. Queue items are auto-dequeued when their requests are completed. Returns a queue ID on success. Arguments:
    • url must be a URL.
    • customData is optional data (of any type) that is stored in the queue item for the URL.
  • .has(id) returns true if the queue contains an active or queued URL tagged with id and false if not.
  • .isPaused returns true if the queue is paused and false if not.
  • .numActiveLinks returns the number of links with active requests.
  • .numQueuedLinks returns the number of links that currently have no active requests.
  • .pause() will pause the queue, but will not pause any active requests.
  • .resume() will resume the queue.

Events

  • 'end' is emitted when the end of the queue has been reached.
  • 'error' is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:
    • error is the Error.
  • 'junk' is emitted for each skipped/unchecked result, as configured in options. Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'link' is emitted for each checked/unskipped result (broken or not). Arguments:
    • result is a Link.
    • customData is whatever was queued.
  • 'queue' is emitted when a URL is queued, dequeued or made active.

Options

cacheMaxAge

Type: Number
Default Value: 3_600_000 (1 hour)
The number of milliseconds in which a cached response should be considered valid. This is only relevant if the cacheResponses option is enabled.

cacheResponses

Type: Boolean
Default Value: true
URL request results will be cached when true. This will ensure that each unique URL will only be checked once.

excludedKeywords

Type: Array<String>
Default value: []
Will not check links that match the keywords and glob patterns within this list. The only wildcards supported are * and !.

This option does not apply to UrlChecker.

excludeExternalLinks

Type: Boolean
Default value: false
Will not check external links (different protocol and/or host) when true; relative links with a remote <base href> included.

This option does not apply to UrlChecker.

excludeInternalLinks

Type: Boolean
Default value: false
Will not check internal links (same protocol and host) when true.

This option does not apply to UrlChecker nor SiteChecker's crawler.

excludeLinksToSamePage

Type: Boolean
Default value: false
Will not check links to the same page; relative and absolute fragments/hashes included. This is only relevant if the cacheResponses option is disabled.

This option does not apply to UrlChecker.

filterLevel

Type: Number
Default value: 1
The tags and attributes that are considered links for checking, split into the following levels:

  • 0: clickable links
  • 1: clickable links, media, frames, meta refreshes
  • 2: clickable links, media, frames, meta refreshes, stylesheets, scripts, forms
  • 3: clickable links, media, frames, meta refreshes, stylesheets, scripts, forms, metadata

Recursive links have a slightly different filter subset. To see the exact breakdown of both, check out the tag map. <base href> is not listed because it is not a link, though it is always parsed.

This option does not apply to UrlChecker.

honorRobotExclusions

Type: Boolean
Default value: true
Will not scan pages that search engine crawlers would not follow. Such will have been specified with any of the following:

  • <a rel="nofollow" href="…">
  • <area rel="nofollow" href="…">
  • <meta name="robots" content="noindex,nofollow,…">
  • <meta name="googlebot" content="noindex,nofollow,…">
  • <meta name="robots" content="unavailable_after: …">
  • X-Robots-Tag: noindex,nofollow,…
  • X-Robots-Tag: googlebot: noindex,nofollow,…
  • X-Robots-Tag: otherbot: noindex,nofollow,…
  • X-Robots-Tag: unavailable_after: …
  • robots.txt

This option does not apply to UrlChecker.

includedKeywords

Type: Array<String>
Default value: []
Will only check links that match the keywords and glob patterns within this list, if any. The only wildcard supported is *.

This option does not apply to UrlChecker.

includeLink

Type: Function
Default value: link => true
A synchronous callback that is called after all other filters have been performed. Return true to include link (a Link) in the list of links to be checked, or return false to have it skipped.

This option does not apply to UrlChecker.

includePage

Type: Function
Default value: url => true
A synchronous callback that is called after all other filters have been performed. Return true to include url (a URL) in the list of pages to be crawled, or return false to have it skipped.

This option does not apply to UrlChecker nor HtmlUrlChecker.

maxSockets

Type: Number
Default value: Infinity
The maximum number of links to check at any given time.

maxSocketsPerHost

Type: Number
Default value: 2
The maximum number of links per host/port to check at any given time. This avoids overloading a single target host with too many concurrent requests. This will not limit concurrent requests to other hosts.

rateLimit

Type: Number
Default value: 0
The number of milliseconds to wait before each request.

requestMethod

Type: String
Default value: 'head'
The HTTP request method used in checking links. If you experience problems, try using 'get', however the retryHeadFail option should have you covered.

retryHeadCodes

Type: Array<Number>
Default value: [405]
The list of HTTP status codes for the retryHeadFail option to reference.

retryHeadFail

Type: Boolean
Default value: true
Some servers do not respond correctly to a 'head' request method. When true, a link resulting in an HTTP status code listed within the retryHeadCodes option will be re-requested using a 'get' method before deciding that it is broken. This is only relevant if the requestMethod option is set to 'head'.

userAgent

Type: String
Default value: 'broken-link-checker/0.8.0 Node.js/14.16.0 (OS X; x64)' (or similar)
The HTTP user-agent to use when checking links as well as retrieving pages and robot exclusions.

Handling Broken/Excluded Links

A broken link will have an isBroken value of true and a reason code defined in brokenReason. A link that was not checked (emitted as 'junk') will have a wasExcluded value of true, a reason code defined in excludedReason and a isBroken value of null.

if (link.get('isBroken')) {
  console.log(link.get('brokenReason'));
  //-> HTTP_406
} else if (link.get('wasExcluded')) {
  console.log(link.get('excludedReason'));
  //-> BLC_ROBOTS
}

Additionally, more descriptive messages are available for each reason code:

const {reasons} = require('broken-link-checker');

console.log(reasons.BLC_ROBOTS);       //-> Robots exclusion
console.log(reasons.ERRNO_ECONNRESET); //-> connection reset by peer (ECONNRESET)
console.log(reasons.HTTP_404);         //-> Not Found (404)

// List all
console.log(reasons);

Putting it all together:

if (link.get('isBroken')) {
  console.log(reasons[link.get('brokenReason')]);
} else if (link.get('wasExcluded')) {
  console.log(reasons[link.get('excludedReason')]);
}

Finally, it is important to analyze links excluded with the BLC_UNSUPPORTED reason as it's possible for them to be broken.

Roadmap Features

  • 'info' event with messaging such as 'Site does not support HTTP HEAD method' (regarding retryHeadFail option)
  • add cheerio support by using parse5's htmlparser2 tree adaptor?
  • load sitemap.xml at start of each SiteChecker site (since cache can expire) to possibly check pages that were not linked to, removing from list as discovered links are checked
  • change order of checking to: tcp error, 4xx code (broken), 5xx code (undetermined), 200
  • abort download of body when options.retryHeadFail===true
  • option to retry broken links a number of times (default=0)
  • option to scrape response.body for erroneous sounding text (using fathom?), since an error page could be presented but still have code 200
  • option to detect parked domain (302 with no redirect?)
  • option to check broken link on archive.org for archived version (using this lib)
  • option to run HtmlUrlChecker checks on page load (using jsdom) to include links added with JavaScript?
  • option to check if hashes exist in target URL document?
  • option to parse Markdown in HtmlChecker for links
  • option to check plain text URLs
  • add throttle profiles (0–9, -1 for "custom") for easy configuring
  • check ftp:, sftp: (for downloadable files)
  • check mailto:, news:, nntp:, telnet:?
  • check that data URLs are valid (with valid-data-url)?
  • supply CORS error for file:// links on sites with a different protocol
  • create an example with http://astexplorer.net
  • use debug
  • use bunyan with JSON output for CLI
  • store request object/headers (or just auth) in Link?
  • supply basic auth for "page" events?
  • add option for URLCache normalization profiles

More Repositories

1

handlebars-react

Compile Handlebars templates to React.
JavaScript
122
star
2

relateurl

Create a relative URL with options to minify.
JavaScript
53
star
3

handlebars-html-parser

Parse Handlebars and HTML.
JavaScript
33
star
4

hidefile

Hide files and directories on all platforms.
JavaScript
27
star
5

winattr

Foolproof Windows® file attributes.
JavaScript
23
star
6

universal-url

WHATWG URL for Node & Browser.
JavaScript
21
star
7

cli-clear

Cross-platform terminal screen clear.
JavaScript
16
star
8

limited-request-queue

Interactively manage concurrency for outbound requests.
JavaScript
15
star
9

camelcase-css

Convert a kebab-cased CSS property into a camelCased DOM property.
JavaScript
13
star
10

normalize-html-whitespace

Safely remove repeating whitespace from HTML text.
JavaScript
12
star
11

html-minify

Reduce file size by shortening URLs and safely removing all standard comments and unnecessary white space from an HTML document.
PHP
12
star
12

jquery.wrecker

wRECkeR: Responsive Equal-Height Columns and Rows
JavaScript
10
star
13

isurl

Determines whether a value is a WHATWG URL.
JavaScript
10
star
14

handlebars-virtual-dom

Compile Handlebars templates to virtual-dom.
9
star
15

url-to-options

Convert a WHATWG URL to an http(s).request options object.
JavaScript
7
star
16

universal-url-lite

A smaller Universal WHATWG URL, for Browserify/etc.
JavaScript
6
star
17

grunt-cleanempty

Clean empty files and folders.
JavaScript
6
star
18

jquery.transitionsend

Execute a callback when ALL css transitions have ended.
JavaScript
6
star
19

http-equiv-refresh

Parse an HTML meta refresh value.
JavaScript
6
star
20

can-transition

Seamlessly integrate CSS transitions into your CanJS v2.x projects.
JavaScript
6
star
21

grunt-log-headers

Hide the running task name header in Grunt's logger.
JavaScript
5
star
22

dom-predicates

Functions for determining if an object is a DOM Node of various types (from any Realm) via duck typing.
JavaScript
4
star
23

broquire

*DEPRECATED* Require different values in a web browser.
JavaScript
4
star
24

auto-tunnel

Simple HTTP(S) proxy tunnelling agents.
JavaScript
4
star
25

scrolling-menu

A custom element for a menu that scrolls horizontally or vertically.
JavaScript
4
star
26

absolute-to-relative-urls

A function/class for use in shortening URL links.
PHP
4
star
27

nodecon

A Node.js debugging console (Mac / Windows / Linux) *BROKEN*
JavaScript
3
star
28

minurl

Reduce and normalize the components of a WHATWG URL.
JavaScript
3
star
29

sourcetally

Source code line counter (Mac / Windows / Linux / web)
JavaScript
3
star
30

robot-directives

Parse robot directives within HTML meta and/or HTTP headers.
JavaScript
3
star
31

urlobj

*DEPRECATED* Performant utilities for URL resolution and parsing.
JavaScript
3
star
32

strip-www

Remove a leading "www" subdomain from a hostname.
JavaScript
3
star
33

edit-dotenv

Edit a .env file string with preserved comments and whitespace.
JavaScript
2
star
34

hasurl

Determine whether Node.js' WHATWG URL implementation is available.
JavaScript
2
star
35

any-match

Determine if a single match exists with an array of strings/numbers/regexes.
JavaScript
2
star
36

urlcache

Normalized URL key-value cache.
JavaScript
2
star
37

unique-number

Generate a unique number.
JavaScript
2
star
38

link-types

Parse an HTML attribute value containing link types.
JavaScript
2
star
39

cloneurl

Clone a WHATWG URL instance.
JavaScript
2
star
40

walk-parse5

Recursively traverse a parse5 DOM tree.
JavaScript
2
star
41

assign-dom-doctype

Insert, replace or remove a DocumentType node within a Document.
JavaScript
2
star
42

event-attributes

Map of HTML and SVG event attributes.
JavaScript
2
star
43

handlebars-html-compiler

Compile templates parsed with handlebars-html-parser.
2
star
44

url-relation

Determine the relation between two URLs.
JavaScript
2
star
45

nopter

*DEPRECATED* Easy command-line executable utilities built on "nopt".
JavaScript
2
star
46

dotenv-prompt

Create and edit .env files via CLI prompts.
JavaScript
2
star
47

replace-dom-string

Replace one or more strings/regexes within a DOM tree.
JavaScript
1
star
48

parse-string-boolean

Parse a string representation of a boolean.
JavaScript
1
star
49

sql-match

Match a string using an SQL pattern.
JavaScript
1
star
50

less-plugin-future-compat

Less.js plugin to avoid conflicts with future CSS features.
JavaScript
1
star
51

evaluate-value

Return a value or an evaluated function (with arguments).
JavaScript
1
star
52

is-dom-detached

Determine if a Node does not exist within a DOM tree.
JavaScript
1
star
53

which-win-shell

Discern between different command-line shells on Windows *BROKEN*
JavaScript
1
star
54

is-dom-void

Determine if an object is a void HTMLElement (from any Realm).
JavaScript
1
star
55

supports-semigraphics

Determine if a terminal/stream supports text animations.
JavaScript
1
star
56

dom-max-size

Determine the maximum scalable dimensions of an HTMLElement.
JavaScript
1
star
57

is-dom-element2

Determine if an object is an HTMLElement (from any Realm).
JavaScript
1
star
58

is-urlsearchparams

Determines whether a value is a URLSearchParams instance.
JavaScript
1
star
59

is-dom-document

Determine if an object is a DOM Document (from any Realm).
JavaScript
1
star
60

incomplete-url

Custom-remove features of a WHATWG URL implementation.
JavaScript
1
star
61

gres

CLI scripts for bootstrapping a PostgreSQL database.
JavaScript
1
star
62

mount-smb

Cross-platform SMB mounting & unmounting
1
star
63

new-js-framework

No name yet
1
star
64

cypress-issues

A repository of branches for reproducing Cypress issues.
1
star
65

create-html-template-element

Create an HTML <template> with content.
JavaScript
1
star