• Stars
    star
    46
  • Rank 613,923 (Top 13 %)
  • Language
    Rust
  • License
    MIT License
  • Created about 2 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Selma selects and matches HTML nodes using CSS rules. Backed by Rust's lol_html parser.

Selma

Selma selects and matches HTML nodes using CSS rules. (It can also reject/delete nodes, but then the name isn't as cool.) It's mostly an idiomatic wrapper around Cloudflare's lol-html project.

Principal Skinner asking Selma after their date: 'Isn't it nice we hate the same things?'

Selma's strength (aside from being backed by Rust) is that HTML content is parsed once and can be manipulated multiple times.

Installation

Add this line to your application's Gemfile:

gem 'selma'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install selma

Usage

Selma can perform two different actions, either independently or together:

  • Sanitize HTML, through a Sanitize-like allowlist syntax; and
  • Select HTML using CSS rules, and manipulate elements and text nodes along the way.

It does this through two kwargs: sanitizer and handlers. The basic API for Selma looks like this:

sanitizer_config = {
   elements: ["b", "em", "i", "strong", "u"],
}
sanitizer = Selma::Sanitizer.new(sanitizer_config)
rewriter = Selma::Rewriter.new(sanitizer: sanitizer, handlers: [MatchElementRewrite.new, MatchTextRewrite.new])
# removes any element that is not  ["b", "em", "i", "strong", "u"];
# then calls `MatchElementRewrite` and `MatchTextRewrite` on matching HTML elements
rewriter.rewrite(html)

Here's a look at each individual part.

Sanitization config

Selma sanitizes by default. That is, even if the sanitizer kwarg is not passed in, sanitization occurs. If you truly want to disable HTML sanitization (for some reason), pass nil:

Selma::Rewriter.new(sanitizer: nil) # dangerous and ill-advised

The configuration for the sanitization process is based on the follow key-value hash allowlist:

# Whether or not to allow HTML comments.
allow_comments: false,

# Whether or not to allow well-formed HTML doctype declarations such as
# "<!DOCTYPE html>" when sanitizing a document.
allow_doctype: false,

# HTML elements to allow. By default, no elements are allowed (which means
# that all HTML will be stripped).
elements: ["a", "b", "img", ],

# HTML attributes to allow in specific elements. The key is the name of the element,
# and the value is an array of allowed attributes. By default, no attributes
# are allowed.
attributes: {
    "a" => ["href"],
    "img" => ["src"],
},

# URL handling protocols to allow in specific attributes. By default, no
# protocols are allowed. Use :relative in place of a protocol if you want
# to allow relative URLs sans protocol.
protocols: {
    "a" => { "href" => ["http", "https", "mailto", :relative] },
    "img" => { "href" => ["http", "https"] },
},

# An Array of element names whose contents will be removed. The contents
# of all other filtered elements will be left behind.
remove_contents: ["iframe", "math", "noembed", "noframes", "noscript"],

# Elements which, when removed, should have their contents surrounded by
# whitespace.
whitespace_elements: ["blockquote", "h1", "h2", "h3", "h4", "h5", "h6", ]

Defining handlers

The real power in Selma comes in its use of handlers. A handler is simply an object with various methods defined:

  • selector, a method which MUST return instance of Selma::Selector which defines the CSS classes to match
  • handle_element, a method that's call on each matched element
  • handle_text_chunk, a method that's called on each matched text node

Here's an example which rewrites the href attribute on a and the src attribute on img to be https rather than http.

class MatchAttribute
  SELECTOR = Selma::Selector(match_element: %(a[href^="http:"], img[src^="http:"]"))

  def handle_element(element)
    if element.tag_name == "a"
      element["href"] = rename_http(element["href"])
    elsif element.tag_name == "img"
      element["src"] = rename_http(element["src"])
    end
  end

  private def rename_http(link)
    link.sub("http", "https")
  end
end

rewriter = Selma::Rewriter.new(handlers: [MatchAttribute.new])

The Selma::Selector object has three possible kwargs:

  • match_element: any element which matches this CSS rule will be passed on to handle_element
  • match_text_within: any text_chunk which matches this CSS rule will be passed on to handle_text_chunk
  • ignore_text_within: this is an array of element names whose text contents will be ignored

Here's an example for handle_text_chunk which changes strings in various elements which are not pre or code:

class MatchText
  SELECTOR = Selma::Selector.new(match_text_within: "*", ignore_text_within: ["pre", "code"])

  def selector
    SELECTOR
  end

  def handle_text_chunk(text)
    string.sub(/@.+/, "<a href=\"www.yetto.app/#{Regexp.last_match}\">")
  end
end

rewriter = Selma::Rewriter.new(handlers: [MatchText.new])

element methods

The element argument in handle_element has the following methods:

  • tag_name: Gets the element's name
  • tag_name=: Sets the element's name
  • self_closing?: A bool which identifies whether or not the element is self-closing
  • []: Get an attribute
  • []=: Set an attribute
  • remove_attribute: Remove an attribute
  • has_attribute?: A bool which identifies whether or not the element has an attribute
  • attributes: List all the attributes
  • ancestors: List all of an element's ancestors as an array of strings
  • before(content, as: content_type): Inserts content before the element. content_type is either :text or :html and determines how the content will be applied.
  • after(content, as: content_type): Inserts content after the element. content_type is either :text or :html and determines how the content will be applied.
  • prepend(content, as: content_type): prepends content to the element's inner content, i.e. inserts content right after the element's start tag. content_type is either :text or :html and determines how the content will be applied.
  • append(content, as: content_type): appends content to the element's inner content, i.e. inserts content right before the element's end tag. content_type is either :text or :html and determines how the content will be applied.
  • set_inner_content: Replaces inner content of the element with content. content_type is either :text or :html and determines how the content will be applied.
  • remove: Removes the element and its inner content.
  • remove_and_keep_content: Removes the element, but keeps its content. I.e. remove start and end tags of the element.
  • removed?: A bool which identifies if the element has been removed or replaced with some content.

text_chunk methods

  • to_s / .content: Gets the text node's content
  • text_type: identifies the type of text in the text node
  • before(content, as: content_type): Inserts content before the text. content_type is either :text or :html and determines how the content will be applied.
  • after(content, as: content_type): Inserts content after the text. content_type is either :text or :html and determines how the content will be applied.
  • replace(content, as: content_type): Replaces the text node with content. content_type is either :text or :html and determines how the content will be applied.

Benchmarks

ruby test/benchmark.rb
ruby test/benchmark.rb
Warming up --------------------------------------
sanitize-document-huge
                         1.000  i/100ms
 selma-document-huge     1.000  i/100ms
Calculating -------------------------------------
sanitize-document-huge
                          0.257  (ยฑ 0.0%) i/s -      2.000  in   7.783398s
 selma-document-huge      4.602  (ยฑ 0.0%) i/s -     23.000  in   5.002870s
Warming up --------------------------------------
sanitize-document-medium
                         2.000  i/100ms
selma-document-medium
                        22.000  i/100ms
Calculating -------------------------------------
sanitize-document-medium
                         28.676  (ยฑ 3.5%) i/s -    144.000  in   5.024669s
selma-document-medium
                        121.500  (ยฑ22.2%) i/s -    594.000  in   5.135410s
Warming up --------------------------------------
sanitize-document-small
                        10.000  i/100ms
selma-document-small    20.000  i/100ms
Calculating -------------------------------------
sanitize-document-small
                        107.280  (ยฑ 0.9%) i/s -    540.000  in   5.033850s
selma-document-small    118.867  (ยฑ31.1%) i/s -    540.000  in   5.080726s

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/gjtorikian/selma. This project is a safe, welcoming space for collaboration.

Acknowledgements

License

The gem is available as open source under the terms of the MIT License.

More Repositories

1

html-pipeline

HTML processing filters and utilities
Ruby
2,250
star
2

html-proofer

Test your rendered HTML files to make sure they're accurate.
Ruby
1,574
star
3

markdowntutorial.com

Lessons to help guide new writers into Markdown!
HTML
515
star
4

commonmarker

Ruby wrapper for the comrak (CommonMark parser) Rust crate
Rust
429
star
5

Earthbound-Battle-Backgrounds-JS

A JavaScript project that generates all the Earthbound/Mother 2 backgrounds.
JavaScript
338
star
6

jekyll-last-modified-at

A Jekyll plugin to show the last_modified_at time of a post.
Ruby
219
star
7

isBinaryFile

Detects if a file is binary in Node.js. Similar to Perl's -B
TypeScript
161
star
8

mathematical

Convert mathematical equations to SVGs, PNGs, or MathML. A general wrapper to Lasem and mtex2MML.
Ruby
155
star
9

Shelves

An Android application that manages your collection of apparel, board games, books, comics, gadgets, movies, music, software, tools, toys, and video games.
Java
110
star
10

biscotto

UNMAINTAINED. CoffeeScript API documentation tool that uses TomDoc notation.
CoffeeScript
105
star
11

addalicense.com

DEPRECATED: Add a license to your public GitHub repositories
JavaScript
85
star
12

jekyll-time-to-read

A liquid tag for Jekyll to indicate the time it takes to read an article.
Ruby
72
star
13

nak

ack and ag inspired tool written in Node. Designed to be fast.
JavaScript
69
star
14

tailwind_merge

Utility function to efficiently merge Tailwind CSS classes without style conflicts.
Ruby
69
star
15

extended-markdown-filter

Some additional Markdown formatting, for use in HTML::Pipeline
Ruby
54
star
16

jekyll-html-pipeline

Use GitHub's HTML::Pipeline, in Jekyll!
Ruby
50
star
17

repository-sync

Simple Sinatra server to keep two repositories in sync (not like the band)
Ruby
45
star
18

ooo-maker

Modify your GitHub avatar to let people know you're away!
JavaScript
41
star
19

Earthbound-Battle-Backgrounds

This is a live wallpaper for Android 2.1+ that shows the battle background animations from Earthbound (Mother 2).
Java
38
star
20

roaster

Turns a raw and crunchy Markdown file into nice and smooth output
CoffeeScript
30
star
21

panda-docs

Pretty Awesome (and Necessary) Documentation Assembly--A total documentation build system for technical writers, and those who want to be like them.
JavaScript
29
star
22

no-more-masters

Rename your default Git branch from master to production
JavaScript
26
star
23

mtex2MML

A Bison grammar to convert TeX math into MathML.
C
24
star
24

graphql-idl-parser

A parser for the GraphQL IDL format.
Rust
24
star
25

panino-docs

API documentation generation tool with an emphasis on JSDoc-style comment parsing
JavaScript
22
star
26

publisher

Publishes your non-Jekyll content in `master` directly to `gh-pages`.
Ruby
19
star
27

ColoredLogcatPlusPlus

An extension of Jeff Sharkey's excellent ColoredLogcat terminal hack for Android development. Supports colors and filtering!
Python
17
star
28

jekyll-config-variables

A Jekyll monkey-patch to allow you to use variables within your _config.yml file.
Ruby
15
star
29

jekyll-conrefifier

Allows you to use Liquid variables in various places in Jekyll
Ruby
14
star
30

what_you_say

Natural language detection library. Written in Rust, wrapped in Ruby.
Ruby
13
star
31

namp

A fork of chjj's marked that adds some additional features
JavaScript
12
star
32

heroku_chat_sample

Go
11
star
33

NotoColorEmoji-png

A conversion of Android KitKat's NotoColorEmoji.ttf into PNG images
11
star
34

nothingherebut.me

A vanishing story.
JavaScript
10
star
35

color-proximity

Match the threshold of a color against a collection of colors.
Ruby
9
star
36

mathematical-node

A Node.js port of Mathematical
HTML
8
star
37

robotstxt-parser

Another fork of the robotstxt gem
Ruby
8
star
38

pointillist

A Ruby library to convert Atom's stylesheets into Pygments-compatible HTML
C
8
star
39

functional-docs

A documentation test suite for HTML files.
JavaScript
7
star
40

jekyll-geo-pattern

A liquid tag for Jekyll to generate an SVG/Base64 geo pattern
Ruby
6
star
41

function-extractor

Extracts all the functions from a Javascript file into an array of objects.
JavaScript
6
star
42

graphql-idl-parser-ruby

A parser for the GraphQL IDL format.
Ruby
6
star
43

branta

Search, for GitHub Pages.
Ruby
6
star
44

markdown_conrefs_js

Support for content references (conrefs) in Markdown (for Javascript).
JavaScript
6
star
45

notext.news

The news without the words
JavaScript
5
star
46

wireless-snes

Arduino sketches to intercept controller data from one SNES console and send it to another.
C++
5
star
47

destroy-all-monuments

This is data taken from the SPLC report titled "Whose Heritage? Public Symbols of the Confederacy" from April 21, 2016
Ruby
5
star
48

jekyll-jsminify

A very simple way to minify your JavaScript and CoffeeScript content in Jekyll.
Ruby
4
star
49

task-lists-js

An implementation of the basic task list logic (in CoffeeScript)
HTML
3
star
50

jekyll-toc-helpers

Some helper tags for generating TOCs.
Ruby
3
star
51

nanoc-redirector

A redirection extension for Nanoc
Ruby
3
star
52

past.codes

Remember the repositories you starred on GitHub.
Ruby
3
star
53

dotfiles-old

My dotfiles.
Shell
3
star
54

heroicons_helper

Heroicons port for Ruby
Ruby
3
star
55

nanoc-conref-fs

A Nanoc filesystem to permit using conrefs/reusables in your content.
Ruby
3
star
56

sffsoccer2ical

Converts the SFF soccer schedule to an ics file
Ruby
3
star
57

documentation-renderer

Documentation tools for Atom and Chrome
CoffeeScript
2
star
58

Shmup-for-Android

An Android port of an iOS Shmup.
D
2
star
59

ecma-re-validator

Validate a regular expression against what ECMA-262 (JavaScript) can actually do.
Ruby
2
star
60

mathematical-rs

Convert MathML into SVG.
Rust
2
star
61

page-toc-filter

A filter for the HTML::Pipeline to generate a page's table of contents.
Ruby
2
star
62

april-24-2015

Ruby
2
star
63

helpmewith.money

HTML
2
star
64

shale

JavaScript
2
star
65

math-to-itex

Parse a string and convert math equations to the itex notation.
Ruby
2
star
66

slack-pokemon-emoji

This is a tool to help you generate Pokรฉmon for your Slack team.
Ruby
2
star
67

changecase_rb

Demo showcasing wrapping Rust in Ruby (RubyConf 2023)
Ruby
2
star
68

kraken

Earthbound's battle backgrounds (as a screensaver)
C#
2
star
69

ADC-Zipcode-Sorter

Takes a CSV file, then sorts the zip codes in that file according to the USPS ADC rules
PHP
2
star
70

sfdc_grpc_api

JavaScript
1
star
71

who-owes-what

TypeScript
1
star
72

jekyll-collection-multiplier

Ruby
1
star
73

fake-pages-site

1
star
74

crud-test

A repository for people to ogle over.
1
star
75

rss-to-tweet

Ruby
1
star
76

colorpicker

A native colorpicker, for Atom.
CSS
1
star
77

tabulalingua

A table. For languages.
JavaScript
1
star
78

atweetforeveryoccasion.com

Tweets are considered official presidential statements; let's contrast 45's communication against his hypocritical actions.
CSS
1
star
79

escapist

Extremely minimal HTML/`href` escaping/unescaping. Emphasis on minimal.
Rust
1
star
80

tweetstorm

CLI tool to make a tweetstorm, with replies to yourself.
Ruby
1
star
81

ace

Ace (Ajax.org Cloud9 Editor)
JavaScript
1
star
82

graffito

JavaScript
1
star
83

YALC

Yet another link checker. This one only checks local files.
Perl
1
star
84

render-html-from-ast

Internal API renderer used by panino and cannolo
JavaScript
1
star
85

jekyll-github-markup

Ruby
1
star
86

jekyll-mathematical

A Jekyll plugin that wraps around Mathematical
Ruby
1
star
87

bartleby

JavaScript
1
star
88

focus_concentrate

Get close. Stay centered. Don't move.
HTML
1
star
89

heroku-buildpack-pango

Heroku buildpack with Pango
Shell
1
star
90

metrocarddump

This program will dump all of your EasyPay MTA rides into a JSON file.
Go
1
star
91

rubocop-standard

Grouped Rubocop rules
Ruby
1
star
92

token_checksum

Generate a 30 character long random token, with a prefix and a 32-bit checksum in the last 6 digits.
Ruby
1
star