Selma selects and matches HTML nodes using CSS rules. (It can also reject/delete nodes, but then the name isn't as cool.) It's mostly an idiomatic wrapper around Cloudflare's lol-html project.
Selma's strength (aside from being backed by Rust) is that HTML content is parsed once and can be manipulated multiple times.
Add this line to your application's Gemfile:
gem 'selma'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install selma
Selma can perform two different actions, either independently or together:
- Sanitize HTML, through a Sanitize-like allowlist syntax; and
- Select HTML using CSS rules, and manipulate elements and text nodes along the way.
It does this through two kwargs: sanitizer
and handlers
. The basic API for Selma looks like this:
sanitizer_config = {
elements: ["b", "em", "i", "strong", "u"],
}
sanitizer = Selma::Sanitizer.new(sanitizer_config)
rewriter = Selma::Rewriter.new(sanitizer: sanitizer, handlers: [MatchElementRewrite.new, MatchTextRewrite.new])
# removes any element that is not ["b", "em", "i", "strong", "u"];
# then calls `MatchElementRewrite` and `MatchTextRewrite` on matching HTML elements
rewriter.rewrite(html)
Here's a look at each individual part.
Selma sanitizes by default. That is, even if the sanitizer
kwarg is not passed in, sanitization occurs. If you truly want to disable HTML sanitization (for some reason), pass nil
:
Selma::Rewriter.new(sanitizer: nil) # dangerous and ill-advised
The configuration for the sanitization process is based on the follow key-value hash allowlist:
# Whether or not to allow HTML comments.
allow_comments: false,
# Whether or not to allow well-formed HTML doctype declarations such as
# "<!DOCTYPE html>" when sanitizing a document.
allow_doctype: false,
# HTML elements to allow. By default, no elements are allowed (which means
# that all HTML will be stripped).
elements: ["a", "b", "img", ],
# HTML attributes to allow in specific elements. The key is the name of the element,
# and the value is an array of allowed attributes. By default, no attributes
# are allowed.
attributes: {
"a" => ["href"],
"img" => ["src"],
},
# URL handling protocols to allow in specific attributes. By default, no
# protocols are allowed. Use :relative in place of a protocol if you want
# to allow relative URLs sans protocol.
protocols: {
"a" => { "href" => ["http", "https", "mailto", :relative] },
"img" => { "href" => ["http", "https"] },
},
# An Array of element names whose contents will be removed. The contents
# of all other filtered elements will be left behind.
remove_contents: ["iframe", "math", "noembed", "noframes", "noscript"],
# Elements which, when removed, should have their contents surrounded by
# whitespace.
whitespace_elements: ["blockquote", "h1", "h2", "h3", "h4", "h5", "h6", ]
The real power in Selma comes in its use of handlers. A handler is simply an object with various methods defined:
selector
, a method which MUST return instance ofSelma::Selector
which defines the CSS classes to matchhandle_element
, a method that's call on each matched elementhandle_text_chunk
, a method that's called on each matched text node
Here's an example which rewrites the href
attribute on a
and the src
attribute on img
to be https
rather than http
.
class MatchAttribute
SELECTOR = Selma::Selector(match_element: %(a[href^="http:"], img[src^="http:"]"))
def handle_element(element)
if element.tag_name == "a"
element["href"] = rename_http(element["href"])
elsif element.tag_name == "img"
element["src"] = rename_http(element["src"])
end
end
private def rename_http(link)
link.sub("http", "https")
end
end
rewriter = Selma::Rewriter.new(handlers: [MatchAttribute.new])
The Selma::Selector
object has three possible kwargs:
match_element
: any element which matches this CSS rule will be passed on tohandle_element
match_text_within
: any text_chunk which matches this CSS rule will be passed on tohandle_text_chunk
ignore_text_within
: this is an array of element names whose text contents will be ignored
Here's an example for handle_text_chunk
which changes strings in various elements which are not pre
or code
:
class MatchText
SELECTOR = Selma::Selector.new(match_text_within: "*", ignore_text_within: ["pre", "code"])
def selector
SELECTOR
end
def handle_text_chunk(text)
string.sub(/@.+/, "<a href=\"www.yetto.app/#{Regexp.last_match}\">")
end
end
rewriter = Selma::Rewriter.new(handlers: [MatchText.new])
The element
argument in handle_element
has the following methods:
tag_name
: Gets the element's nametag_name=
: Sets the element's nameself_closing?
: A bool which identifies whether or not the element is self-closing[]
: Get an attribute[]=
: Set an attributeremove_attribute
: Remove an attributehas_attribute?
: A bool which identifies whether or not the element has an attributeattributes
: List all the attributesancestors
: List all of an element's ancestors as an array of stringsbefore(content, as: content_type)
: Insertscontent
before the element.content_type
is either:text
or:html
and determines how the content will be applied.after(content, as: content_type)
: Insertscontent
after the element.content_type
is either:text
or:html
and determines how the content will be applied.prepend(content, as: content_type)
: prependscontent
to the element's inner content, i.e. inserts content right after the element's start tag.content_type
is either:text
or:html
and determines how the content will be applied.append(content, as: content_type)
: appendscontent
to the element's inner content, i.e. inserts content right before the element's end tag.content_type
is either:text
or:html
and determines how the content will be applied.set_inner_content
: Replaces inner content of the element withcontent
.content_type
is either:text
or:html
and determines how the content will be applied.remove
: Removes the element and its inner content.remove_and_keep_content
: Removes the element, but keeps its content. I.e. remove start and end tags of the element.removed?
: A bool which identifies if the element has been removed or replaced with some content.
to_s
/.content
: Gets the text node's contenttext_type
: identifies the type of text in the text nodebefore(content, as: content_type)
: Insertscontent
before the text.content_type
is either:text
or:html
and determines how the content will be applied.after(content, as: content_type)
: Insertscontent
after the text.content_type
is either:text
or:html
and determines how the content will be applied.replace(content, as: content_type)
: Replaces the text node withcontent
.content_type
is either:text
or:html
and determines how the content will be applied.
ruby test/benchmark.rb ruby test/benchmark.rb Warming up -------------------------------------- sanitize-document-huge 1.000 i/100ms selma-document-huge 1.000 i/100ms Calculating ------------------------------------- sanitize-document-huge 0.257 (ยฑ 0.0%) i/s - 2.000 in 7.783398s selma-document-huge 4.602 (ยฑ 0.0%) i/s - 23.000 in 5.002870s Warming up -------------------------------------- sanitize-document-medium 2.000 i/100ms selma-document-medium 22.000 i/100ms Calculating ------------------------------------- sanitize-document-medium 28.676 (ยฑ 3.5%) i/s - 144.000 in 5.024669s selma-document-medium 121.500 (ยฑ22.2%) i/s - 594.000 in 5.135410s Warming up -------------------------------------- sanitize-document-small 10.000 i/100ms selma-document-small 20.000 i/100ms Calculating ------------------------------------- sanitize-document-small 107.280 (ยฑ 0.9%) i/s - 540.000 in 5.033850s selma-document-small 118.867 (ยฑ31.1%) i/s - 540.000 in 5.080726s
Bug reports and pull requests are welcome on GitHub at https://github.com/gjtorikian/selma. This project is a safe, welcoming space for collaboration.
- https://github.com/flavorjones/ruby-c-extensions-explained#strategy-3-precompiled and Nokogiri for hints on how to ship precompiled cross-platform gems
- @vmg for his work at GitHub on goomba, from which some design patterns were learned
- sanitize for a comprehensive configuration API and test suite
The gem is available as open source under the terms of the MIT License.