• Stars
    star
    125
  • Rank 286,335 (Top 6 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 7 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Allowlist-based HTML cleaner

HTML sanitizer

This is a allowlist-based and very opinionated HTML sanitizer that can be used both for untrusted and trusted sources. It attempts to clean up the mess made by various rich text editors and or copy-pasting to make styling of webpages simpler and more consistent. It builds on the excellent HTML cleaner in lxml to make the result both valid and safe.

HTML sanitizer goes further than e.g. bleach in that it not only ensures that content is safe and tags and attributes conform to a given allowlist, but also applies additional transforms to HTML fragments.

Goals

  • Clean up HTML fragments using a very restricted set of allowed tags and attributes.
  • Convert some tags (such as <span style="...">, <b> and <i>) into either <strong> or <em> (but never both).
  • Absolutely disallow all inline styles.
  • Normalize whitespace by removing repeated line breaks, empty paragraphs and other empty elements.
  • Merge adjacent tags of the same type (such as several <strong> or <h3> directly after each other.
  • Automatically remove redundant list markers inside <li> tags.
  • Clean up some uglyness such as paragraphs inside paragraphs or list elements etc.
  • Normalize unicode.

Usage

>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer()  # default configuration
>>> sanitizer.sanitize('<span style="font-weight:bold">some text</span>')
'<strong>some text</strong>'

Settings

  • Bold spans and b tags are converted into strong tags, italic spans and i tags into em tags (if strong and em are allowed at all)
  • Inline styles and scripts will always be dropped.
  • A div element is used to wrap the HTML fragment for the parser, therefore div tags are not allowed.

The default settings are:

DEFAULT_SETTINGS = {
    "tags": {
        "a", "h1", "h2", "h3", "strong", "em", "p", "ul", "ol",
        "li", "br", "sub", "sup", "hr",
    },
    "attributes": {"a": ("href", "name", "target", "title", "id", "rel")},
    "empty": {"hr", "a", "br"},
    "separate": {"a", "p", "li"},
    "whitespace": {"br"},
    "keep_typographic_whitespace": False,
    "add_nofollow": False,
    "autolink": False,
    "sanitize_href": sanitize_href,
    "element_preprocessors": [
        # convert span elements into em/strong if a matching style rule
        # has been found. strong has precedence, strong & em at the same
        # time is not supported
        bold_span_to_strong,
        italic_span_to_em,
        tag_replacer("b", "strong"),
        tag_replacer("i", "em"),
        tag_replacer("form", "p"),
        target_blank_noopener,
    ],
    "element_postprocessors": [],
    "is_mergeable": lambda e1, e2: True,
}

The keys' meaning is as follows:

  • tags: A set() of allowed tags.
  • attributes: A dict() mapping tags to their allowed attributes.
  • empty: Tags which are allowed to be empty. By default, empty tags (containing no text or only whitespace) are dropped.
  • separate: Tags which are not merged if they appear as siblings. By default, tags of the same type are merged.
  • whitespace: Tags which are treated as whitespace and removed from the beginning or end of other tags' content.
  • keep_typographic_whitespace: Keep typographically used space characters like non-breaking space etc.
  • add_nofollow: Whether to add rel="nofollow" to all links.
  • autolink: Enable lxml's autolinker. May be either a boolean or a dictionary; a dictionary is passed as keyword arguments to autolink.
  • sanitize_href: A callable that gets anchor's href value and returns a sanitized version. The default implementation checks whether links start with a few allowed prefixes, and if not, returns a single hash (#).
  • element_preprocessors and element_postprocessors: Additional filters that are called on all elements in the tree. The tree is processed in reverse depth-first order. Under certain circumstances elements are processed more than once (search the code for backlog.append). Preprocessors are run before whitespace normalization, postprocessors afterwards.
  • is_mergeable: Adjacent elements which aren't kept separate are merged by default. This callable can be used to prevent merging of adjacent elements e.g. when their classes do not match (lambda e1, e2: e1.get('class') == e2.get('class'))

Settings can be specified partially when initializing a sanitizer instance, but are still checked for consistency. For example, it is not allowed to have tags in empty that are not in tags, that is, tags that are allowed to be empty but at the same time not allowed at all. The Sanitizer constructor raises TypeError exceptions when it detects inconsistencies.

An example for an even more restricted configuration might be:

>>> from html_sanitizer import Sanitizer
>>> sanitizer = Sanitizer({
...     'tags': ('h1', 'h2', 'p'),
...     'attributes': {},
...     'empty': set(),
...     'separate': set(),
... })

The rationale for such a restricted set of allowed tags (e.g. no images) is documented in the design decisions section of django-content-editor's documentation.

Django

HTML sanitizer does not depend on Django, but ships with a module which makes configuring sanitizers using Django settings easier. Usage is as follows:

>>> from html_sanitizer.django import get_sanitizer
>>> sanitizer = get_sanitizer([name=...])

Different sanitizers can be configured. The default configuration is aptly named 'default'. Example settings follow:

HTML_SANITIZERS = {
    'default': {
      'tags': ...,
    },
    ...
}

The 'default' configuration is special: If it isn't explicitly defined, the default configuration above is used instead. Non-existing configurations will lead to ImproperlyConfigured exceptions.

The get_sanitizer function caches sanitizer instances, so feel free to call it as often as you want to.

Security issues

Please report security issues to me directly at [email protected].

More Repositories

1

plata

Plata - the lean and mean Django-based Shop
Python
197
star
2

pdfdocument

ReportLab-wrapper
Python
145
star
3

django-translated-fields

Django model translation without magic-inflicted pain.
Python
108
star
4

django-imagefield

You should probably use this image field instead of Django's built-in models.ImageField.
Python
100
star
5

django-prose-editor

ProseMirror-based HTML editor for Django
JavaScript
91
star
6

towel

Towel: Keeping you DRY since 2010
Python
81
star
7

django-authlib

Utilities for passwordless authentication (using magic links, Google, Facebook and Twitter OAuth currently)
Python
65
star
8

django-admin-ordering

Drag-drop orderable change lists and inlines done right.
Python
64
star
9

workbench

Django-based Agency Software (time tracking, project management, addressbook, offering and invoicing)
Python
42
star
10

django-json-schema-editor

JavaScript
35
star
11

blacknoise

Python
33
star
12

django-email-registration

So simple you'll burst into tears right away.
Python
30
star
13

django-recent-objects

Python
27
star
14

django-newsletter-subscription

Python
27
star
15

django-cte-forest

django-cte-forest implements efficient adjacency list trees using Django and PostgreSQL Common Table Expressions (CTE).
Python
26
star
16

django-user-messages

Offline addon for django.contrib.messages
Python
26
star
17

django-newswall

This is my version of a Tumblelog
Python
24
star
18

django-js-asset

Python
19
star
19

django-geolocation

Python
13
star
20

django-user-payments

User payments and subscriptions for Django
Python
13
star
21

survey

Django-based survey application
Python
11
star
22

django-sitemaps

sitemap.xml generation using lxml with support for alternates.
Python
11
star
23

django-specifications

A place to store auxiliary information for your Django models.
Python
10
star
24

pinging

Simple module offering pinging capabilities for any Django model
Python
8
star
25

traduire

Traduire (french for «translate») is a web-based platform for editing gettext translations.
Python
8
star
26

django-mooch

Reusable interfaces to a few payment providers
Python
7
star
27

django-http-fallback-storage

Storage which automatically downloads missing files from production on access.
Python
5
star
28

django-fast-export

Python
5
star
29

cldr_countries

Support for using the CLDR core territories data in Django
Python
5
star
30

django-curtains

Middleware for only allowing access to staff members
Python
5
star
31

swisdk2

Simple Web Infrastructure SDK
PHP
5
star
32

speckenv

Load environment from .env. Includes some optional Django goodies.
Python
4
star
33

xlsxdocument

Python
4
star
34

hypercube

Simple wireframe hypercube with SDL
C++
4
star
35

plata-options-product

Plata options product (formerly plata/product/modules/options/)
Python
4
star
36

django-webpack-bundle

Python
3
star
37

django-prune-uploads

Python
3
star
38

slack-logbot

Python
3
star
39

django-canonical-domain

Canonical domain redirection for Django
Python
3
star
40

django-fineforms

Form rendering for Django
Python
3
star
41

django-spark

Event sourcing and handling
Python
3
star
42

django-keyed-urls

An app for those cases when you need language-specific URLs in the database for use in templates or as redirects.
Python
3
star
43

django-fhadmin

Python
3
star
44

tusk

Version -1 of FeinCMS (historical)
JavaScript
2
star
45

yatzee

<1h hack
HTML
2
star
46

django-variable-admin

Administration interface styling using CSS variables
CSS
2
star
47

towel-foundation

Utilities for using towel with Zurb Foundation 4
JavaScript
2
star
48

dotfiles

Python
2
star
49

digitalinitiative

2
star
50

borkbak

Rebuild git-managed backup histories, which hold too much cruft from times long gone.
Python
2
star
51

jquery-moodboard

jQuery-based moodboard / slideshow script (v42)
JavaScript
2
star
52

south

South mirror
Python
2
star
53

towel-bootstrap

towel_bootstrap
JavaScript
2
star
54

zivinetz

Zivinetz
Python
2
star
55

django-flock

Python
1
star
56

mmmoney

mmmoney
Python
1
star
57

aoc

Python
1
star
58

fhp

JavaScript
1
star
59

fhp-presentations

1
star
60

django-tinyforum

Python
1
star
61

feincms-photos

Whatever...
Python
1
star
62

django-chet

JavaScript
1
star
63

django-mkadmin

Adds a dashboard and a quick menu to the topbar for creating content and accessing modules
Python
1
star
64

django-milkdown

Python
1
star