• Stars
    star
    2,250
  • Rank 19,752 (Top 0.4 %)
  • Language
    Ruby
  • License
    MIT License
  • Created almost 12 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

HTML processing filters and utilities

HTML-Pipeline

Note This README refers to the behavior in the new 3.0.0.pre gem.

HTML processing filters and utilities. This module is a small framework for defining CSS-based content filters and applying them to user provided content.

Although this project was started at GitHub, they no longer use it. This gem must be considered standalone and independent from GitHub.

Installation

Add this line to your application's Gemfile:

gem 'html-pipeline'

And then execute:

$ bundle

Or install it by yourself as:

$ gem install html-pipeline

Usage

This library provides a handful of chainable HTML filters to transform user content into HTML markup. Each filter does some work, and then hands off the results tothe next filter. A pipeline has several kinds of filters available to use:

  • Multiple TextFilters, which operate a UTF-8 string
  • A ConvertFilter filter, which turns text into HTML (eg., Commonmark/Asciidoc -> HTML)
  • A SanitizationFilter, which remove dangerous/unwanted HTML elements and attributes
  • Multiple NodeFilters, which operate on a UTF-8 HTML document

You can assemble each sequence into a single pipeline, or choose to call each filter individually.

As an example, suppose we want to transform Commonmark source text into Markdown HTML. With the content, we also want to:

  • change every instance of $NAME to "`Johnny"
  • strip undesired HTML
  • linkify @mention

We can construct a pipeline to do all that like this:

require 'html_pipeline'

class HelloJohnnyFilter < HTMLPipelineFilter
  def call
    text.gsub("$NAME", "Johnny")
  end
end

pipeline = HTMLPipeline.new(
  text_filters: [HelloJohnnyFilter.new]
  convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new),
    # note: next line is not needed as sanitization occurs by default;
    # see below for more info
  sanitization_config: HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG,
  node_filters: [HTMLPipeline::NodeFilter::MentionFilter.new]
)
pipeline.call(user_supplied_text) # recommended: can call pipeline over and over

Filters can be custom ones you create (like HelloJohnnyFilter), and HTMLPipeline additionally provides several helpful ones (detailed below). If you only need a single filter, you can call one individually, too:

filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new(text)
filter.call

Filters combine into a sequential pipeline, and each filter hands its output to the next filter's input. Text filters are processed first, then the convert filter, sanitization filter, and finally, the node filters.

Some filters take optional context and/or result hash(es). These are used to pass around arguments and metadata between filters in a pipeline. For example, if you want to disable footnotes in the MarkdownFilter, you can pass an option in the context hash:

context =  { markdown: extensions: { footnotes: false } }
filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new("Hi **world**!", context: context)
filter.call

Please refer to the documentation for each filter to understand what configuration options are available.

More Examples

Different pipelines can be defined for different parts of an app. Here are a few paraphrased snippets to get you started:

# The context hash is how you pass options between different filters.
# See individual filter source for explanation of options.
context = {
  asset_root: "http://your-domain.com/where/your/images/live/icons",
  base_url: "http://your-domain.com"
}

# Pipeline used for user provided content on the web
MarkdownPipeline = HTMLPipeline.new (
  text_filters: [HTMLPipeline::TextFilter::ImageMaxWidthFilter.new],
  convert_filter: [HTMLPipeline::ConvertFilter::MarkdownFilter.new],
  node_filters: [
    HTMLPipeline::NodeFilter::HttpsFilter.new,HTMLPipeline::NodeFilter::MentionFilter.new,
  ], context: context)

# Pipelines aren't limited to the web. You can use them for email
# processing also.
HtmlEmailPipeline = HTMLPipeline.new(
  text_filters: [
    PlainTextInputFilter.new,
    ImageMaxWidthFilter.new
  ], {})

Filters

TextFilters

TextFilters must define a method named call which is called on the text. @text, @config, and @result are available to use, and any changes made to these ivars are passed on to the next filter.

  • ImageFilter - converts image url into <img> tag
  • PlainTextInputFilter - html escape text and wrap the result in a <div>

ConvertFilter

The ConvertFilter takes text and turns it into HTML. @text, @config, and @result are available to use. ConvertFilter must defined a method named call, taking one argument, text. call must return a string representing the new HTML document.

  • MarkdownFilter - creates HTML from text using Commonmarker

Sanitization

Because the web can be a scary place, HTML is automatically sanitized after the ConvertFilter runs and before the NodeFilters are processed. This is to prevent malicious or unexpected input from entering the pipeline.

The sanitization process takes a hash configuration of settings. See the Selma documentation for more information on how to configure these settings.

A default sanitization config is provided by this library (HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG). A sample custom sanitization allowlist might look like this:

ALLOWLIST = {
  elements: ["p", "pre", "code"]
}

pipeline = HTMLPipeline.new \
  text_filters: [
    HTMLPipeline::MarkdownFilter,
  ],
  convert_filter: [HTMLPipeline::ConvertFilter::MarkdownFilter.new],
  sanitization_config: ALLOWLIST

result = pipeline.call <<-CODE
This is *great*:

    some_code(:first)

CODE
result[:output].to_s

This would print:

<p>This is great:</p>
<pre><code>some_code(:first)
</code></pre>

Sanitization can be disabled if and only if nil is explicitly passed as the config:

pipeline = HTMLPipeline.new \
  text_filters: [
    HTMLPipeline::MarkdownFilter,
  ],
  convert_filter: [HTMLPipeline::ConvertFilter::MarkdownFilter.new],
  sanitization_config: nil

For more examples of customizing the sanitization process to include the tags you want, check out the tests and the FAQ.

NodeFilters

NodeFilterss can operate either on HTML elements or text nodes using CSS selectors. Each NodeFilter must define a method named selector which provides an instance of Selma::Selector. If elements are being manipulated, handle_element must be defined, taking one argument, element; if text nodes are being manipulated, handle_text_chunk must be defined, taking one argument, text_chunk. @config, and @result are available to use, and any changes made to these ivars are passed on to the next filter.

NodeFilter also has an optional method, after_initialize, which is run after the filter initializes. This can be useful in setting up a custom state for result to take advantage of.

Here's an example NodeFilter that adds a base url to images that are root relative:

require 'uri'

class RootRelativeFilter < HTMLPipeline::NodeFilter

  SELECTOR = Selma::Selector.new(match_element: "img")

  def selector
    SELECTOR
  end

  def handle_element(img)
    next if img['src'].nil?
    src = img['src'].strip
    if src.start_with? '/'
      img["src"] = URI.join(context[:base_url], src).to_s
    end
  end
end

For more information on how to write effective NodeFilters, refer to the provided filters, and see the underlying lib, Selma for more information.

  • AbsoluteSourceFilter: replace relative image urls with fully qualified versions
  • EmojiFilter: converts :<emoji>: to emoji
    • (Note: the included MarkdownFilter will already convert emoji)
  • HttpsFilter: Replacing http urls with https versions
  • ImageMaxWidthFilter: link to full size image for large images
  • MentionFilter: replace @user mentions with links
  • SanitizationFilter: allow sanitize user markup
  • SyntaxHighlightFilter: applies syntax highlighting to pre blocks
    • (Note: the included MarkdownFilter will already apply highlighting)
  • TableOfContentsFilter: anchor headings with name attributes and generate Table of Contents html unordered list linking headings
  • TeamMentionFilter: replace @org/team mentions with links

Dependencies

Since filters can be customized to your heart's content, gem dependencies are not bundled; this project doesn't know which of the default filters you might use, and as such, you must bundle each filter's gem dependencies yourself.

For example, SyntaxHighlightFilter uses rouge to detect and highlight languages; to use the SyntaxHighlightFilter, you must add the following to your Gemfile:

gem "rouge"

Note See the Gemfile :test group for any version requirements.

When developing a custom filter, call HTMLPipeline.require_dependency at the start to ensure that the local machine has the necessary dependency. You can also use HTMLPipeline.require_dependencies to provide a list of dependencies to check.

Documentation

Full reference documentation can be found here.

Instrumenting

Filters and Pipelines can be set up to be instrumented when called. The pipeline must be setup with an ActiveSupport::Notifications compatible service object and a name. New pipeline objects will default to the HTMLPipeline.default_instrumentation_service object.

# the AS::Notifications-compatible service object
service = ActiveSupport::Notifications

# instrument a specific pipeline
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline", service

# or set default instrumentation service for all new pipelines
HTMLPipeline.default_instrumentation_service = service
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline"

Filters are instrumented when they are run through the pipeline. A call_filter.html_pipeline event is published once any filter finishes; call_text_filters and call_node_filters is published when all of the text and node filters are finished, respectively. The payload should include the filter name. Each filter will trigger its own instrumentation call.

service.subscribe "call_filter.html_pipeline" do |event, start, ending, transaction_id, payload|
  payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
  payload[:filter] #=> "MarkdownFilter"
  payload[:context] #=> context Hash
  payload[:result] #=> instance of result class
  payload[:result][:output] #=> output HTML String
end

The full pipeline is also instrumented:

service.subscribe "call_text_filters.html_pipeline" do |event, start, ending, transaction_id, payload|
  payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
  payload[:filters] #=> ["MarkdownFilter"]
  payload[:doc] #=> HTML String
  payload[:context] #=> context Hash
  payload[:result] #=> instance of result class
  payload[:result][:output] #=> output HTML String
end

Third Party Extensions

If you have an idea for a filter, propose it as an issue first. This allows us to discuss whether the filter is a common enough use case to belong in this gem, or should be built as an external gem.

Here are some extensions people have built:

FAQ

1. Why doesn't my pipeline work when there's no root element in the document?

To make a pipeline work on a plain text document, put the PlainTextInputFilter at the end of your text_filters config . This will wrap the content in a div so the filters have a root element to work with. If you're passing in an HTML fragment, but it doesn't have a root element, you can wrap the content in a div yourself.

2. How do I customize an allowlist for SanitizationFilters?

HTMLPipeline::SanitizationFilter::ALLOWLIST is the default allowlist used if no sanitization_config argument is given. The default is a good starting template for you to add additional elements. You can either modify the constant's value, or re-define your own config and pass that in, such as:

config = HTMLPipeline::SanitizerFilter::DEFAULT_CONFIG.dup
config[:elements] << "iframe" # sure, whatever you want

Contributors

Thanks to all of these contributors.

This project is a member of the OSS Manifesto.

More Repositories

1

html-proofer

Test your rendered HTML files to make sure they're accurate.
Ruby
1,550
star
2

markdowntutorial.com

Lessons to help guide new writers into Markdown!
HTML
515
star
3

commonmarker

Ruby wrapper for the comrak (CommonMark parser) Rust crate
Ruby
404
star
4

Earthbound-Battle-Backgrounds-JS

A JavaScript project that generates all the Earthbound/Mother 2 backgrounds.
JavaScript
338
star
5

jekyll-last-modified-at

A Jekyll plugin to show the last_modified_at time of a post.
Ruby
219
star
6

isBinaryFile

Detects if a file is binary in Node.js. Similar to Perl's -B
TypeScript
161
star
7

mathematical

Convert mathematical equations to SVGs, PNGs, or MathML. A general wrapper to Lasem and mtex2MML.
Ruby
155
star
8

Shelves

An Android application that manages your collection of apparel, board games, books, comics, gadgets, movies, music, software, tools, toys, and video games.
Java
110
star
9

biscotto

UNMAINTAINED. CoffeeScript API documentation tool that uses TomDoc notation.
CoffeeScript
105
star
10

addalicense.com

DEPRECATED: Add a license to your public GitHub repositories
JavaScript
85
star
11

jekyll-time-to-read

A liquid tag for Jekyll to indicate the time it takes to read an article.
Ruby
72
star
12

nak

ack and ag inspired tool written in Node. Designed to be fast.
JavaScript
69
star
13

tailwind_merge

Utility function to efficiently merge Tailwind CSS classes without style conflicts.
Ruby
56
star
14

extended-markdown-filter

Some additional Markdown formatting, for use in HTML::Pipeline
Ruby
54
star
15

jekyll-html-pipeline

Use GitHub's HTML::Pipeline, in Jekyll!
Ruby
51
star
16

repository-sync

Simple Sinatra server to keep two repositories in sync (not like the band)
Ruby
45
star
17

selma

Selma selects and matches HTML nodes using CSS rules. Backed by Rust's lol_html parser.
Rust
43
star
18

ooo-maker

Modify your GitHub avatar to let people know you're away!
JavaScript
41
star
19

Earthbound-Battle-Backgrounds

This is a live wallpaper for Android 2.1+ that shows the battle background animations from Earthbound (Mother 2).
Java
38
star
20

roaster

Turns a raw and crunchy Markdown file into nice and smooth output
CoffeeScript
30
star
21

panda-docs

Pretty Awesome (and Necessary) Documentation Assembly--A total documentation build system for technical writers, and those who want to be like them.
JavaScript
29
star
22

no-more-masters

Rename your default Git branch from master to production
JavaScript
26
star
23

graphql-idl-parser

A parser for the GraphQL IDL format.
Rust
24
star
24

panino-docs

API documentation generation tool with an emphasis on JSDoc-style comment parsing
JavaScript
22
star
25

mtex2MML

A Bison grammar to convert TeX math into MathML.
C
21
star
26

publisher

Publishes your non-Jekyll content in `master` directly to `gh-pages`.
Ruby
19
star
27

ColoredLogcatPlusPlus

An extension of Jeff Sharkey's excellent ColoredLogcat terminal hack for Android development. Supports colors and filtering!
Python
17
star
28

jekyll-config-variables

A Jekyll monkey-patch to allow you to use variables within your _config.yml file.
Ruby
15
star
29

jekyll-conrefifier

Allows you to use Liquid variables in various places in Jekyll
Ruby
14
star
30

what_you_say

Natural language detection library. Written in Rust, wrapped in Ruby.
Ruby
13
star
31

namp

A fork of chjj's marked that adds some additional features
JavaScript
12
star
32

heroku_chat_sample

Go
11
star
33

NotoColorEmoji-png

A conversion of Android KitKat's NotoColorEmoji.ttf into PNG images
11
star
34

nothingherebut.me

A vanishing story.
JavaScript
10
star
35

color-proximity

Match the threshold of a color against a collection of colors.
Ruby
9
star
36

mathematical-node

A Node.js port of Mathematical
HTML
8
star
37

pointillist

A Ruby library to convert Atom's stylesheets into Pygments-compatible HTML
C
8
star
38

robotstxt-parser

Another fork of the robotstxt gem
Ruby
8
star
39

functional-docs

A documentation test suite for HTML files.
JavaScript
7
star
40

jekyll-geo-pattern

A liquid tag for Jekyll to generate an SVG/Base64 geo pattern
Ruby
6
star
41

function-extractor

Extracts all the functions from a Javascript file into an array of objects.
JavaScript
6
star
42

graphql-idl-parser-ruby

A parser for the GraphQL IDL format.
Ruby
6
star
43

branta

Search, for GitHub Pages.
Ruby
6
star
44

markdown_conrefs_js

Support for content references (conrefs) in Markdown (for Javascript).
JavaScript
6
star
45

notext.news

The news without the words
JavaScript
5
star
46

destroy-all-monuments

This is data taken from the SPLC report titled "Whose Heritage? Public Symbols of the Confederacy" from April 21, 2016
Ruby
5
star
47

wireless-snes

Arduino sketches to intercept controller data from one SNES console and send it to another.
C++
5
star
48

jekyll-jsminify

A very simple way to minify your JavaScript and CoffeeScript content in Jekyll.
Ruby
4
star
49

task-lists-js

An implementation of the basic task list logic (in CoffeeScript)
HTML
3
star
50

jekyll-toc-helpers

Some helper tags for generating TOCs.
Ruby
3
star
51

nanoc-redirector

A redirection extension for Nanoc
Ruby
3
star
52

past.codes

Remember the repositories you starred on GitHub.
Ruby
3
star
53

nanoc-conref-fs

A Nanoc filesystem to permit using conrefs/reusables in your content.
Ruby
3
star
54

dotfiles-old

My dotfiles.
Shell
3
star
55

heroicons_helper

Heroicons port for Ruby
Ruby
3
star
56

sffsoccer2ical

Converts the SFF soccer schedule to an ics file
Ruby
3
star
57

documentation-renderer

Documentation tools for Atom and Chrome
CoffeeScript
2
star
58

Shmup-for-Android

An Android port of an iOS Shmup.
D
2
star
59

ecma-re-validator

Validate a regular expression against what ECMA-262 (JavaScript) can actually do.
Ruby
2
star
60

page-toc-filter

A filter for the HTML::Pipeline to generate a page's table of contents.
Ruby
2
star
61

mathematical-rs

Convert MathML into SVG.
Rust
2
star
62

april-24-2015

Ruby
2
star
63

helpmewith.money

HTML
2
star
64

shale

JavaScript
2
star
65

math-to-itex

Parse a string and convert math equations to the itex notation.
Ruby
2
star
66

slack-pokemon-emoji

This is a tool to help you generate Pokรฉmon for your Slack team.
Ruby
2
star
67

changecase_rb

Demo showcasing wrapping Rust in Ruby (RubyConf 2023)
Ruby
2
star
68

kraken

Earthbound's battle backgrounds (as a screensaver)
C#
2
star
69

ADC-Zipcode-Sorter

Takes a CSV file, then sorts the zip codes in that file according to the USPS ADC rules
PHP
2
star
70

sfdc_grpc_api

JavaScript
1
star
71

who-owes-what

TypeScript
1
star
72

colorpicker

A native colorpicker, for Atom.
CSS
1
star
73

jekyll-collection-multiplier

Ruby
1
star
74

fake-pages-site

1
star
75

crud-test

A repository for people to ogle over.
1
star
76

rss-to-tweet

Ruby
1
star
77

tabulalingua

A table. For languages.
JavaScript
1
star
78

atweetforeveryoccasion.com

Tweets are considered official presidential statements; let's contrast 45's communication against his hypocritical actions.
CSS
1
star
79

ace

Ace (Ajax.org Cloud9 Editor)
JavaScript
1
star
80

escapist

Extremely minimal HTML/`href` escaping/unescaping. Emphasis on minimal.
Rust
1
star
81

tweetstorm

CLI tool to make a tweetstorm, with replies to yourself.
Ruby
1
star
82

graffito

JavaScript
1
star
83

YALC

Yet another link checker. This one only checks local files.
Perl
1
star
84

render-html-from-ast

Internal API renderer used by panino and cannolo
JavaScript
1
star
85

jekyll-github-markup

Ruby
1
star
86

jekyll-mathematical

A Jekyll plugin that wraps around Mathematical
Ruby
1
star
87

bartleby

JavaScript
1
star
88

focus_concentrate

Get close. Stay centered. Don't move.
HTML
1
star
89

heroku-buildpack-pango

Heroku buildpack with Pango
Shell
1
star
90

metrocarddump

This program will dump all of your EasyPay MTA rides into a JSON file.
Go
1
star
91

token_checksum

Generate a 30 character long random token, with a prefix and a 32-bit checksum in the last 6 digits.
Ruby
1
star