• Stars
    star
    3,747
  • Rank 11,766 (Top 0.3 %)
  • Language
    Python
  • License
    Other
  • Created about 12 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fixes mojibake and other glitches in Unicode text, after the fact.

ftfy: fixes text for you

PyPI package Docs

>>> from ftfy import fix_encoding
>>> print(fix_encoding("(ง'⌣')ง"))
('⌣')

The full documentation of ftfy is available at ftfy.readthedocs.org. The documentation covers a lot more than this README, so here are some links into it:

Testimonials

  • “My life is livable again!” — @planarrowspace
  • “A handy piece of magic” — @simonw
  • “Saved me a large amount of frustrating dev work” — @iancal
  • “ftfy did the right thing right away, with no faffing about. Excellent work, solving a very tricky real-world (whole-world!) problem.” — Brennan Young
  • “I have no idea when I’m gonna need this, but I’m definitely bookmarking it.” — /u/ocrow
  • “9.2/10” — pylint

What it does

Here are some examples (found in the real world) of what ftfy can do:

ftfy can fix mojibake (encoding mix-ups), by detecting patterns of characters that were clearly meant to be UTF-8 but were decoded as something else:

>>> import ftfy
>>> ftfy.fix_text('✔ No problems')
'✔ No problems'

Does this sound impossible? It's really not. UTF-8 is a well-designed encoding that makes it obvious when it's being misused, and a string of mojibake usually contains all the information we need to recover the original string.

ftfy can fix multiple layers of mojibake simultaneously:

>>> ftfy.fix_text('The Mona Lisa doesn’t have eyebrows.')
"The Mona Lisa doesn't have eyebrows."

It can fix mojibake that has had "curly quotes" applied on top of it, which cannot be consistently decoded until the quotes are uncurled:

>>> ftfy.fix_text("l’humanité")
"l'humanité"

ftfy can fix mojibake that would have included the character U+A0 (non-breaking space), but the U+A0 was turned into an ASCII space and then combined with another following space:

>>> ftfy.fix_text('Ã\xa0 perturber la réflexion')
'à perturber la réflexion'
>>> ftfy.fix_text('à perturber la réflexion')
'à perturber la réflexion'

ftfy can also decode HTML entities that appear outside of HTML, even in cases where the entity has been incorrectly capitalized:

>>> # by the HTML 5 standard, only 'PÉREZ' is acceptable
>>> ftfy.fix_text('PÉREZ')
'PÉREZ'

These fixes are not applied in all cases, because ftfy has a strongly-held goal of avoiding false positives -- it should never change correctly-decoded text to something else.

The following text could be encoded in Windows-1252 and decoded in UTF-8, and it would decode as 'MARQUɅ'. However, the original text is already sensible, so it is unchanged.

>>> ftfy.fix_text('IL Y MARQUÉ…')
'IL Y MARQUÉ…'

Installing

ftfy is a Python 3 package that can be installed using pip:

pip install ftfy

(Or use pip3 install ftfy on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)

Local development

ftfy is developed using poetry. Its setup.py is vestigial and is not the recommended way to install it.

Install Poetry, check out this repository, and run poetry install to install ftfy for local development, such as experimenting with the heuristic or running tests.

Who maintains ftfy?

I'm Robyn Speer, also known as Elia Robyn Lake. You can find me on GitHub or Twitter.

Citing ftfy

ftfy has been used as a crucial data processing step in major NLP research.

It's important to give credit appropriately to everyone whose work you build on in research. This includes software, not just high-status contributions such as mathematical models. All I ask when you use ftfy for research is that you cite it.

ftfy has a citable record on Zenodo. A citation of ftfy may look like this:

Robyn Speer. (2019). ftfy (Version 5.5). Zenodo.
http://doi.org/10.5281/zenodo.2591652

In BibTeX format, the citation is::

@misc{speer-2019-ftfy,
  author       = {Robyn Speer},
  title        = {ftfy},
  note         = {Version 5.5},
  year         = 2019,
  howpublished = {Zenodo},
  doi          = {10.5281/zenodo.2591652},
  url          = {https://doi.org/10.5281/zenodo.2591652}
}

More Repositories

1

wordfreq

Access a database of word frequencies, in various natural languages.
Python
698
star
2

langcodes

A Python library for working with and comparing language codes.
Python
340
star
3

ordered-set

A mutable set that remembers the order of its entries. One of Python's missing data types.
Python
210
star
4

wiki2text

Extract a plain text corpus from MediaWiki XML dumps, such as Wikipedia.
Nim
132
star
5

dominiate

A simulator for Dominion card game strategies
JavaScript
120
star
6

text-as-data

A PyData 2013 talk on straightforward, data-driven ways to handle natural language text in Python.
Python
50
star
7

wikiparsec

An LL parser for extracting information from Wiki text, particularly Wiktionary.
Haskell
48
star
8

solvertools

Mystery Hunt solving tools for Metropolitan Rage Warehouse. Or anyone really.
JavaScript
32
star
9

scholar.hasfailed.us

Google Scholar is a trans-exclusionary site. Don't use it. Help us demand change.
HTML
22
star
10

dominiate-python

A Python implementation of the card game Dominion
Python
15
star
11

openmind-commons

The dynamic Web site that lets people browse and contribute to Open Mind Common Sense and ConceptNet.
JavaScript
11
star
12

dominionstats

The code behind councilroom.com.
JavaScript
11
star
13

csc-pysparse

A fast sparse matrix library for Python (Commonsense Computing version)
C
10
star
14

music-decomp

Associating music/sound and semantics
Python
10
star
15

mixmaster

Smarter than the average anagrammer.
Python
9
star
16

language_data

An optional supplement to `langcodes` that stores names and statistics of languages.
Python
7
star
17

scorepile

A repository of Innovation games played on Isotropic
JavaScript
6
star
18

solvertools-2014

Julia
4
star
19

adventure

Common sense experiments for working with text adventures.
Python
4
star
20

charcol

An experiment to collect unusual characters from Twitter.
Python
4
star
21

verb-aspect-learning

A hierarchical Bayesian model of biases in how people learn novel verbs
3
star
22

dominion-rank

Calculate ranks based on people's play on dominion.isomorphic.org.
Python
3
star
23

countmerge

A command-line tool that adds counts for sorted keys.
Rust
3
star
24

svdview

A Processing viewer for the results of dimensionality reduction.
Java
3
star
25

spacious_corpus

A corpus build process for use with SpaCy projects
Python
3
star
26

colorizer

JavaScript
2
star
27

analogy_farm

A Web-based puzzle from MIT Mystery Hunt 2013.
Python
2
star
28

irepad

An IRE PROOF collaborative editor, built on FirePad.
JavaScript
2
star
29

rust-nlp-tools

Rust
2
star
30

rspeer-web

My personal Web site.
JavaScript
2
star
31

rspeer.github.io

rspeer's Octopress site
TeX
1
star