• Stars
    star
    464
  • Rank 94,450 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python port of Moses tokenizer, truecaser and normalizer

Sacremoses

Build Status Build status Downloads

License

MIT License.

Install

pip install -U sacremoses

NOTE: Sacremoses only supports Python 3 now (sacremoses>=0.0.41). If you're using Python 2, the last possible version is sacremoses==0.0.40.

Usage (Python)

Tokenizer and Detokenizer

>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mt = MosesTokenizer(lang='en')
>>> text = 'This, is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'
>>> expected_tokenized = 'This , is a sentence with weird \xbb symbols \u2026 appearing everywhere \xbf'
>>> tokenized_text = mt.tokenize(text, return_str=True)
>>> tokenized_text == expected_tokenized
True


>>> mt, md = MosesTokenizer(lang='en'), MosesDetokenizer(lang='en')
>>> sent = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?"
>>> expected_tokens = ['This', 'ain', '&apos;t', 'funny', '.', 'It', '&apos;s', 'actually', 'hillarious', ',', 'yet', 'double', 'Ls', '.', '&#124;', '&#91;', '&#93;', '&lt;', '&gt;', '&#91;', '&#93;', '&amp;', 'You', '&apos;re', 'gonna', 'shake', 'it', 'off', '?', 'Don', '&apos;t', '?']
>>> expected_detokens = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?"
>>> mt.tokenize(sent) == expected_tokens
True
>>> md.detokenize(tokens) == expected_detokens
True

Truecaser

>>> from sacremoses import MosesTruecaser, MosesTokenizer

# Train a new truecaser from a 'big.txt' file.
>>> mtr = MosesTruecaser()
>>> mtok = MosesTokenizer(lang='en')

# Save the truecase model to 'big.truecasemodel' using `save_to`
>> tokenized_docs = [mtok.tokenize(line) for line in open('big.txt')]
>>> mtr.train(tokenized_docs, save_to='big.truecasemodel')

# Save the truecase model to 'big.truecasemodel' after training
# (just in case you forgot to use `save_to`)
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.save_model('big.truecasemodel')

# Truecase a string after training a model.
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']

# Loads a model and truecase a string using trained model.
>>> mtr = MosesTruecaser('big.truecasemodel')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", return_str=True)
'the ADVENTURES OF SHERLOCK HOLMES'
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", return_str=True, use_known=True)
'the adventures of Sherlock Holmes'

Normalizer

>>> from sacremoses import MosesPunctNormalizer
>>> mpn = MosesPunctNormalizer()
>>> mpn.normalize('THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."')
'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'

Usage (CLI)

Since version 0.0.42, the pipeline feature for CLI is introduced, thus there are global options that should be set first before calling the commands:

  • language
  • processes
  • encoding
  • quiet
$ pip install -U sacremoses>=0.0.42

$ sacremoses --help
Usage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

Options:
  -l, --language TEXT      Use language specific rules when tokenizing
  -j, --processes INTEGER  No. of processes.
  -e, --encoding TEXT      Specify encoding of file.
  -q, --quiet              Disable progress bar.
  --version                Show the version and exit.
  -h, --help               Show this message and exit.

Commands:
  detokenize
  detruecase
  normalize
  tokenize
  train-truecase
  truecase

Pipeline

Example to chain the following commands:

  • normalize with -c option to remove control characters.
  • tokenize with -a option for aggressive dash split rules.
  • truecase with -a option to indicate that model is for ASR
    • if big.truemodel exists, load the model with -m option,
    • otherwise train a model and save it with -m option to big.truemodel file.
  • save the output to console to the big.txt.norm.tok.true file.
cat big.txt | sacremoses -l en -j 4 \
    normalize -c tokenize -a truecase -a -m big.truemodel \
    > big.txt.norm.tok.true

Tokenizer

$ sacremoses tokenize --help
Usage: sacremoses tokenize [OPTIONS]

Options:
  -a, --aggressive-dash-splits   Triggers dash split rules.
  -x, --xml-escape               Escape special characters for XML.
  -p, --protected-patterns TEXT  Specify file with patters to be protected in
                                 tokenisation.
  -c, --custom-nb-prefixes TEXT  Specify a custom non-breaking prefixes file,
                                 add prefixes to the default ones from the
                                 specified language.
  -h, --help                     Show this message and exit.


 $ sacremoses -l en -j 4 tokenize  < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 24363.39it/s

 $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns
 $ sacremoses -l en -j 4 tokenize -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

Detokenizer

$ sacremoses detokenize --help
Usage: sacremoses detokenize [OPTIONS]

Options:
  -x, --xml-unescape  Unescape special characters for XML.
  -h, --help          Show this message and exit.

 $ sacremoses -l en -j 4 detokenize < big.txt.tok > big.txt.tok.detok
100%|██████████████████████████████████| 128457/128457 [00:16<00:00, 7931.26it/s]

Truecase

$ sacremoses truecase --help
Usage: sacremoses truecase [OPTIONS]

Options:
  -m, --modelfile TEXT            Filename to save/load the modelfile.
                                  [required]
  -a, --is-asr                    A flag to indicate that model is for ASR.
  -p, --possibly-use-first-token  Use the first token as part of truecase
                                  training.
  -h, --help                      Show this message and exit.

$ sacremoses -j 4 truecase -m big.model < big.txt.tok > big.txt.tok.true
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 14257.27it/s]

Detruecase

$ sacremoses detruecase --help
Usage: sacremoses detruecase [OPTIONS]

Options:
  -j, --processes INTEGER  No. of processes.
  -a, --is-headline        Whether the file are headlines.
  -e, --encoding TEXT      Specify encoding of file.
  -h, --help               Show this message and exit.

$ sacremoses -j 4 detruecase  < big.txt.tok.true > big.txt.tok.true.detrue
100%|█████████████████████████████████| 128457/128457 [00:04<00:00, 26945.16it/s]

Normalize

$ sacremoses normalize --help
Usage: sacremoses normalize [OPTIONS]

Options:
  -q, --normalize-quote-commas  Normalize quotations and commas.
  -d, --normalize-numbers       Normalize number.
  -p, --replace-unicode-puncts  Replace unicode punctuations BEFORE
                                normalization.
  -c, --remove-control-chars    Remove control characters AFTER normalization.
  -h, --help                    Show this message and exit.

$ sacremoses -j 4 normalize < big.txt > big.txt.norm
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 13096.23it/s]

More Repositories

1

pywsd

Python Implementations of Word Sense Disambiguation (WSD) Technologies.
Python
725
star
2

awesome-community-curated-nlp

Community Curated NLP List
187
star
3

stasis

Semantic Textual Similarity in Python
Jupyter Notebook
80
star
4

Quotables

A Corpus of Quotes
68
star
5

annotate-questionnaire

Summary of Responses to Questionnaire on Annotation Platform https://forms.gle/iZk8kehkjAWmB8xe9
59
star
6

nltk_cli

Python
19
star
7

usaarhat-repo

Hack and Tell @ Saarland University
PHP
19
star
8

gachalign

Gale-Church sentence aligner with options for variable parameters
Python
17
star
9

tsundoku

PyTorch Tutorials for NLP with Deep Learning
Jupyter Notebook
17
star
10

spaghetti-tagger

Recipe for Spanish POS tagging using the CESS corpus with NLTK
Python
17
star
11

expletives

Expletives vomiting library...
Python
12
star
12

DLTK

Deutsch Language Tool Kit
Perl
12
star
13

lazyme

Lazy python recipes.
Python
11
star
14

USAAR-SemEval-2015

USAAR participation in SemEval2015
SourcePawn
11
star
15

SeedLing

Building and Using A Seed Corpus for the Human Language Project
Python
10
star
16

kopitiam

How to Order Coffee in Singapore?
Jupyter Notebook
10
star
17

charguana

Character Vomiting
Python
10
star
18

NTU-MC

Nanyang Technological University - Multilingual Corpus (STB subcorpora)
Python
9
star
19

myth

Myanmar and Thai Language Resources
Shell
9
star
20

rubberduck

Yet another Python API to DuckDuckGo Instant Answer API.
Python
7
star
21

vegetables

Collection of Repackaged Word Embeddings
6
star
22

mini-segmenter

Lightweight lexicon/dictionary based Chinese text segmenter
Python
6
star
23

earthy

Earthy: Academic-strength NLP
Python
5
star
24

bayesmax

Bayesian Classifiers for Language Identification
Python
5
star
25

MacSaar-CWI

Zipfian and Character-level features for Complex Word Identification
Python
5
star
26

Eliezer

Eli Machine Translation
4
star
27

boredom

When bored, code.
Jupyter Notebook
4
star
28

bayesline-DSL

A Multinomial Bayesian Classification for Language Identification
Python
4
star
29

cliffjumper

Neural Search.
3
star
30

yubin

Japanese Address Munger
Python
3
star
31

entroplexity

Sense Entropy and Sentence Perplexity for Complex Word Identification
Python
3
star
32

lightyear

Python
3
star
33

translation-cloud

Visualizing word translations as clouds.
Python
3
star
34

Terminator

Python
3
star
35

Endocentricity

Jupyter Notebook
2
star
36

nltk2

A fresh rewrite
2
star
37

cranium

Bashing CLI arguments
Python
2
star
38

whyclick

Cos I don't like clicking...
Python
2
star
39

sugali

Python
2
star
40

warmth

WMT data in Python
Erlang
2
star
41

mindset

Python
2
star
42

vanilla-moses

Python
2
star
43

shoganai

2
star
44

stubboRNNess

Complex word identification with RNN
Python
2
star
45

warppipe

Warp Pile (ワãƒŧプ土įŽĄ)
Python
1
star
46

dopplershift

Pythonic SQL for mere mortals
Python
1
star
47

annotated-ordered-rnn

1
star
48

aomame

Python
1
star
49

shiva-something

Perl
1
star
50

Basic-NLP

Basic NLP for PUG-SG (25Oct)
1
star
51

hooper

Lets see what 5 hours can do...
Jupyter Notebook
1
star
52

oque

Python
1
star
53

moulton

Jupyter Notebook
1
star
54

Wikicorpus

Perl
1
star
55

burpee

Pseudo Byte Pair Encoding
Python
1
star
56

onigiri

Python SDK for RIT Translate
Python
1
star
57

evilunicorn

There'll be no place to run when you're caught within the grip of the evil Unicorn...
Python
1
star
58

decepticon

1
star
59

watercooler

Significant Machine Translation news/gossips...
1
star
60

merlin

secret, shhhh...
Python
1
star
61

data

Lets try this again...
Jupyter Notebook
1
star
62

pywsd_data

Python
1
star
63

pyBabelNet

for BabelNet v2.5
Python
1
star
64

toktok

Stand-alone Python port of https://github.com/jonsafari/tok-tok
Python
1
star
65

SuGarLike

Language Identification for Low Resource Languages (by Susanne, Guy and Liling)
Python
1
star
66

mitochondria

Jupyter Notebook
1
star
67

spirit-guess

Rewrite of https://pypi.org/project/guess-language/
Python
1
star