• Stars
    star
    959
  • Rank 47,306 (Top 1.0 %)
  • Language
    Python
  • License
    MIT License
  • Created about 12 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Convert HTML to Markdown

GitHub Workflow Status Pypi version License Pypi Downloads

Installation

pip install markdownify

Usage

Convert some HTML to Markdown:

from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>')  # > '**Yay** [GitHub](http://github.com)'

Specify tags to exclude:

from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', strip=['a'])  # > '**Yay** GitHub'

...or specify the tags you want to include:

from markdownify import markdownify as md
md('<b>Yay</b> <a href="http://github.com">GitHub</a>', convert=['b'])  # > '**Yay** GitHub'

Options

Markdownify supports the following options:

strip
A list of tags to strip. This option can't be used with the convert option.
convert
A list of tags to convert. This option can't be used with the strip option.
autolinks
A boolean indicating whether the "automatic link" style should be used when a a tag's contents match its href. Defaults to True.
default_title
A boolean to enable setting the title of a link to its href, if no title is given. Defaults to False.
heading_style
Defines how headings should be converted. Accepted values are ATX, ATX_CLOSED, SETEXT, and UNDERLINED (which is an alias for SETEXT). Defaults to UNDERLINED.
bullets
An iterable (string, list, or tuple) of bullet styles to be used. If the iterable only contains one item, it will be used regardless of how deeply lists are nested. Otherwise, the bullet will alternate based on nesting level. Defaults to '*+-'.
strong_em_symbol
In markdown, both * and _ are used to encode strong or emphasized texts. Either of these symbols can be chosen by the options ASTERISK (default) or UNDERSCORE respectively.
sub_symbol, sup_symbol
Define the chars that surround <sub> and <sup> text. Defaults to an empty string, because this is non-standard behavior. Could be something like ~ and ^ to result in ~sub~ and ^sup^.
newline_style
Defines the style of marking linebreaks (<br>) in markdown. The default value SPACES of this option will adopt the usual two spaces and a newline, while BACKSLASH will convert a linebreak to \\n (a backslash and a newline). While the latter convention is non-standard, it is commonly preferred and supported by a lot of interpreters.
code_language
Defines the language that should be assumed for all <pre> sections. Useful, if all code on a page is in the same programming language and should be annotated with ```python or similar. Defaults to '' (empty string) and can be any string.
code_language_callback

When the HTML code contains pre tags that in some way provide the code language, for example as class, this callback can be used to extract the language from the tag and prefix it to the converted pre tag. The callback gets one single argument, an BeautifylSoup object, and returns a string containing the code language, or None. An example to use the class name as code language could be:

def callback(el):
    return el['class'][0] if el.has_attr('class') else None

Defaults to None.

escape_asterisks
If set to False, do not escape * to \* in text. Defaults to True.
escape_underscores
If set to False, do not escape _ to \_ in text. Defaults to True.
keep_inline_images_in
Images are converted to their alt-text when the images are located inside headlines or table cells. If some inline images should be converted to markdown images instead, this option can be set to a list of parent tags that should be allowed to contain inline images, for example ['td']. Defaults to an empty list.
wrap, wrap_width
If wrap is set to True, all text paragraphs are wrapped at wrap_width characters. Defaults to False and 80. Use with newline_style=BACKSLASH to keep line breaks in paragraphs.

Options may be specified as kwargs to the markdownify function, or as a nested Options class in MarkdownConverter subclasses.

Converting BeautifulSoup objects

from markdownify import MarkdownConverter

# Create shorthand method for conversion
def md(soup, **options):
    return MarkdownConverter(**options).convert_soup(soup)

Creating Custom Converters

If you have a special usecase that calls for a special conversion, you can always inherit from MarkdownConverter and override the method you want to change:

from markdownify import MarkdownConverter

class ImageBlockConverter(MarkdownConverter):
    """
    Create a custom MarkdownConverter that adds two newlines after an image
    """
    def convert_img(self, el, text, convert_as_inline):
        return super().convert_img(el, text, convert_as_inline) + '\n\n'

# Create shorthand method for conversion
def md(html, **options):
    return ImageBlockConverter(**options).convert(html)

Command Line Interface

Use markdownify example.html > example.md or pipe input from stdin (cat example.html | markdownify > example.md). Call markdownify -h to see all available options. They are the same as listed above and take the same arguments.

Development

To run tests and the linter run pip install tox once, then tox.

More Repositories

1

django-imagekit

Automated image processing for Django. Currently v4.0
Python
2,259
star
2

react-controllables

Easily create controllable components
JavaScript
256
star
3

pilkit

Utilities and processors built for, and on top of PIL
Python
195
star
4

monorouter

An isomorphic JS router
JavaScript
141
star
5

react-mediaswitch

Choose your DOM based on media queries
CoffeeScript
59
star
6

react-frozenhead

Make your whole page a React component and render it on the server
JavaScript
57
star
7

django-classbasedsettings

Class based settings. You couldn't get that from the title?
Python
52
star
8

httpplease.js

The polite HTTP request library for node and the browser
JavaScript
31
star
9

django-html5boilerplate

A packaging of Paul Irish's HTML5 Boilerplate for Django projects.
Python
20
star
10

markdown-with-front-matter-loader

A webpack loader for markdown with yaml front matter (think Jekyll)
JavaScript
19
star
11

jquery-icbiacontrol

Style browser controls without losing their native behaviors.
JavaScript
12
star
12

assert-this.js

A clear assertion style that uses virtual methods instead of wrappers
JavaScript
7
star
13

grunt-jinja

A grunt plugin for compiling Jinja2 templates with James Long's nunjucks templating system
CoffeeScript
6
star
14

jquery.picplus

Take control of your images.
JavaScript
6
star
15

django-throttleandcache

Cache the results of arbitrary function calls!
Python
5
star
16

gulp-heroku-deploy-slug

Deploy slug archives to Heroku
JavaScript
4
star
17

react-template

Organized rendering.
JavaScript
3
star
18

wrapperator.js

Create a function that can be used as a wrapper or method decorator
JavaScript
2
star
19

jquery.picplus.queueup

Use a load queue to load your picplus pictures. Load the important stuff first.
JavaScript
2
star
20

jquery.picplus.lazyload

Add lazyloading to your picplus pictures. Load them when they're scrolled into your viewport.
CoffeeScript
2
star
21

httpplease-promises

A plugin that adds promise support to httpplease
JavaScript
2
star
22

monorouter-react

ReactJS tools for monorouter
JavaScript
2
star
23

django-maid

Django maid cleans up your orphaned media files.
Python
1
star
24

Google-Closure.tmbundle

The fastest way to get started with Google Closure.
JavaScript
1
star
25

jquery.parallaxin

Create an element that scrolls at a different speed than the document but always stays within a container.
JavaScript
1
star
26

django-scspostgis

A PostGIS db backend that works with standard_conforming_strings.
Python
1
star