• Stars
    star
    909
  • Rank 50,251 (Top 1.0 %)
  • Language
    Python
  • License
    Other
  • Created almost 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🧹 Python package for text cleaning

clean-text Build Status PyPI PyPI - Python Version PyPI - Downloads

User-generated content on the Web and in social media is often dirty. Preprocess your scraped data with clean-text to create a normalized text representation. For instance, turn this corrupted input:

A bunch of \\u2018new\\u2019 references, including [Moana](https://en.wikipedia.org/wiki/Moana_%282016_film%29).


»Yóù àré     rïght <3!«

into this clean output:

A bunch of 'new' references, including [moana](<URL>).

"you are right <3!"

clean-text uses ftfy, unidecode and numerous hand-crafted rules, i.e., RegEx.

Installation

To install the GPL-licensed package unidecode alongside:

pip install clean-text[gpl]

You may want to abstain from GPL:

pip install clean-text

NB: This package is named clean-text and not cleantext.

If unidecode is not available, clean-text will resort to Python's unicodedata.normalize for transliteration. Transliteration to closest ASCII symbols involes manually mappings, i.e., ê to e. unidecode's mapping is superiour but unicodedata's are sufficent. However, you may want to disable this feature altogether depending on your data and use case.

To make it clear: There are inconsistencies between processing text with or without unidecode.

Usage

from cleantext import clean

clean("some input",
    fix_unicode=True,               # fix various unicode errors
    to_ascii=True,                  # transliterate to closest ASCII representation
    lower=True,                     # lowercase text
    no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
    no_urls=False,                  # replace all URLs with a special token
    no_emails=False,                # replace all email addresses with a special token
    no_phone_numbers=False,         # replace all phone numbers with a special token
    no_numbers=False,               # replace all numbers with a special token
    no_digits=False,                # replace all digits with a special token
    no_currency_symbols=False,      # replace all currency symbols with a special token
    no_punct=False,                 # remove punctuations
    replace_with_punct="",          # instead of removing punctuations you may replace them
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    lang="en"                       # set to 'de' for German special handling
)

Carefully choose the arguments that fit your task. The default parameters are listed above.

You may also only use specific functions for cleaning. For this, take a look at the source code.

Supported languages

So far, only English and German are fully supported. It should work for the majority of western languages. If you need some special handling for your language, feel free to contribute. 🙃

Using clean-text with scikit-learn

There is also scikit-learn compatible API to use in your pipelines. All of the parameters above work here as well.

pip install clean-text[gpl,sklearn]
pip install clean-text[sklearn]
from cleantext.sklearn import CleanTransformer

cleaner = CleanTransformer(no_punct=False, lower=False)

cleaner.transform(['Happily clean your text!', 'Another Input'])

Development

Use poetry.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

If you don't like the output of clean-text, consider adding a test with your specific input and desired output.

Related Work

Generic text cleaning packages

Full-blown NLP libraries with some text cleaning

Remove or replace strings

Detect dates

Clean massive Common Crawl data

Acknowledgements

Built upon the work by Burton DeWilde for Textacy.

License

Apache

More Repositories

1

Sublime-Text-Plugins-for-Frontend-Web-Development

📝 Collection of plugins for Frontend Web Development
1,134
star
2

react-native-onboarding-swiper

🛳 Delightful onboarding for your React-Native app
JavaScript
927
star
3

split-folders

🗂 Split folders with files (i.e. images) into training, validation and test (dataset) folders
Python
403
star
4

pdf-scripts

📑 Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs
Shell
55
star
5

text-classification-keras

📚 Text classification library with Keras
Python
52
star
6

frag-den-staat-app

📱 iOS & Android App for FragDenStaat, the German FOI portal
JavaScript
25
star
7

hgmaassen-retweets

Hans-Georg Maaßen and the Retweets
Jupyter Notebook
23
star
8

brunch-on-speed

🍽 Skeleton for Brunch for a long-scroll, single, static Web page
HTML
18
star
9

ulmfit-for-german

👩‍🏫 Pre-trained German Language Model with sub-word tokenization for ULMFIT
Jupyter Notebook
16
star
10

hyperhyper

🧮 Python package to construct word embeddings for small data using PMI and SVD
Python
15
star
11

ptf-kommentare

Notes & code for my Protoypefund project about Machine Learning & news comments & language change
Jupyter Notebook
11
star
12

youdata

🇪🇺 Because it's about you and your data. (discontinued)
JavaScript
10
star
13

eesti-kelt

🇪🇪 English to Estonian dictionary with all the three important cases (discontinued)
JavaScript
9
star
14

german-abbreviations

📖 A list of 4262 German abbreviations from Wiktionary
Python
9
star
15

german-preprocessing

🇩🇪 Preprocess German texts to do some serious natural-language processing.
Python
8
star
16

get-retries

Adding retries to Requests.get() with exponential backoff
Python
6
star
17

wikipedia-edits-verified-accounts

Get all revisions and recent changes for verified German Wikipedia users
Python
6
star
18

german-lemmatizer

✂️ Python package (using a Docker image under the hood) to lemmatize German texts.
Python
6
star
19

deep-plots

📉 Visualize your Deep Learning training in static graphics
Python
5
star
20

scrape-gutenberg-de

Scrape all Books from Projekt Gutenberg-DE
Python
5
star
21

masters-thesis

Master's Thesis: Conversation-aware Classification of News Comments
Jupyter Notebook
5
star
22

rechte-gewalt

Mapping of right-wing incidents in Germany
Python
4
star
23

get-wayback-machine

Fetch a URL via the latest Wayback Machine snapshot
Python
4
star
24

most-frequent-words-2019-german-eu-election-programs

Visualization of the most frequent words in the German 2019 EU election programs
Jupyter Notebook
4
star
25

MDMA

Make Deep Art Accessible
Python
3
star
26

sparse-svd-benchmark

Sparse Truncated SVD Benchmark (Python)
Jupyter Notebook
3
star
27

mw-category-members

Using MediaWiki's API, retrieve pages that belong to a given category
Python
2
star
28

btw21

Visualization of the most frequent words in the German federal election in 2021
Jupyter Notebook
2
star
29

nsu-urteil

Most frequent sentences in the written judgment against the NSU
Jupyter Notebook
2
star
30

offene-register-text-analysis

Text analysis of German corporates' names and associated officers
Jupyter Notebook
2
star
31

oauth-proxy

A simple proxy for OAuth to hide the client secret.
JavaScript
1
star
32

utils

bash scripts, dotfiles
Shell
1
star
33

german-lemmatizer-docker

✂️ Combining the power of several tools for lemmatization of German text
Python
1
star
34

autobahn

Playing around with data about broken bridges on the German Autobahn
R
1
star
35

tweets-with-images

Get all tweets with images from a given Twitter user
Python
1
star
36

00-dokku-default

Add a dummy lexicographically first site to a Dokku instance to act as default site
HTML
1
star
37

nlp

Solutions for a course in NLP in Winter 2014/15 @ OVGU, Magdeburg
Python
1
star
38

universal-style-transfer-pytorch

Universal Style Transfer in PyTorch (improved)
Python
1
star
39

hpi-kurs-zuordnung

Determine optimal specializations and course assignments @ HPI
JavaScript
1
star
40

ifg.jfilter.de

Blog for my investigative reporting using German FOI laws
Shell
1
star
41

lobbyalarm

🚨 Browser Plugin to Highlight Lobbyism (in Germany)
Python
1
star
42

blog-examples

Example code of my blog posts
R
1
star