• Stars
    star
    147
  • Rank 249,820 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 11 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A toolbox for working with the Chinese language in Python

Build Status

=========== Mafan - Toolkit for working with Chinese in Python

Mafan is a collection of Python tools for making your life working with Chinese so much less 麻烦 (mafan, i.e. troublesome).

Contained in here is an ever-growing collection of loosely-related tools, broken down into several files. These are:

installation

Install through pip:

pip install mafan

encodings

encodings contains functions for converting files from any number of 麻烦 character encodings to something more sane (utf-8, by default). For example:

from mafan import encoding

filename = 'ugly_big5.txt' # name or path of file as string
encoding.convert(filename) # creates a file with name 'ugly_big5_utf-8.txt' in glorious utf-8 encoding

text

text contains some functions for working with strings. Things like detecting english in a string, whether a string has Chinese punctuation, etc. Check out text.py for all the latest goodness. It also contains a handy wrapper for the jianfan package for converting between simplified and traditional:

>>> from mafan import simplify, tradify
>>> string = u'这是麻烦啦'
>>> print tradify(string) # convert string to traditional
這是麻煩啦
>>> print simplify(tradify(string)) # convert back to simplified
这是麻烦啦

The has_punctuation and contains_latin functions are useful for knowing whether you are really dealing with Chinese, or Chinese characters:

>>> from mafan import text
>>> text.has_punctuation(u'这是麻烦啦') # check for any Chinese punctuation (full-stops, commas, quotation marks, etc)
False
>>> text.has_punctuation(u'这是麻烦啦.')
False
>>> text.has_punctuation(u'这是麻烦啦。')
True
>>> text.contains_latin(u'这是麻烦啦。')
False
>>> text.contains_latin(u'You are麻烦啦。')
True

You can also test whether sentences or documents use simplified characters, traditional characters, both or neither:

>>> import mafan
>>> from mafan import text
>>> text.is_simplified(u'这是麻烦啦')
True
>>> text.is_traditional(u'Hello,這是麻煩啦') # ignores non-chinese characters
True

# Or done another way:
>>> text.identify(u'这是麻烦啦') is mafan.SIMPLIFIED
True
>>> text.identify(u'這是麻煩啦') is mafan.TRADITIONAL
True
>>> text.identify(u'这是麻烦啦! 這是麻煩啦') is mafan.BOTH
True
>>> text.identify(u'This is so mafan.') is mafan.NEITHER # or None
True

The identification functionality is introduced as a very thin wrapper to Thomas Roten's hanzidentifier, which is included as part of mafan.

Another function that comes pre-built into Mafan is split_text, which tokenizes Chinese sentences into words:

>>> from mafan import split_text
>>> split_text(u"這是麻煩啦")
[u'\u9019', u'\u662f', u'\u9ebb\u7169', u'\u5566']
>>> print ' '.join(split_text(u"這是麻煩啦"))
  麻煩 

You can also optionally pass the boolean include_part_of_speech parameter to get tagged words back:

>>> split_text(u"這是麻煩啦", include_part_of_speech=True)
[(u'\u9019', 'r'), (u'\u662f', 'v'), (u'\u9ebb\u7169', 'x'), (u'\u5566', 'y')]

pinyin

pinyin contains functions for working with or converting between pinyin. At the moment, the only function in there is one to convert numbered pinyin to the pinyin with correct tone marks. For example:

>>> from mafan import pinyin
>>> print pinyin.decode("ni3hao3")
nǐhǎo

traditional characters

If you want to be able to use split_text on traditional characters, you can make use of one of two options:

  • Either set an environment variable, MAFAN_DICTIONARY_PATH, to the absolute path to a local copy of this dictionary file,
  • or install the mafan_traditional convenience package: pip install mafan_traditional. If this package is installed and available, mafan will default to use this extended dictionary file.

Contributors:

Any contributions are very welcome!

Sites using this:

More Repositories

1

ironzebra

A Go blogging engine
CSS
203
star
2

chinese-ime

A JavaScript jQuery plugin for building Chinese keyboard input capabilities natively into a website
JavaScript
56
star
3

jieba-js

A JavaScript Chinese word segmentation tool based on Python Jieba
JavaScript
42
star
4

go-euler

Project Euler solutions written in Go for your enjoyment
Go
30
star
5

pythai

A collection of tools for working with the Thai language in Python
Python
29
star
6

regex-crossword-solver

A regex crossword solver written in Go, for puzzles like the ones on regexcrossword.com
Go
27
star
7

go-string-concat-benchmarks

Benchmarks to compare the different string concatenation methods in Go
Go
18
star
8

cedict

An open source Go parser for the CC-CEDICT Chinese Dictionary
Go
13
star
9

enchant

Go bindings for the Enchant spellcheck library
Go
8
star
10

allrgb

A Go image generator that uses all colors in the color palette exactly once
8
star
11

prettyprint

Go HTML PrettyPrint
Go
7
star
12

facebook-hacker-cup

Facebook Hacker Cup solutions in Go - just for fun and learning
Go
6
star
13

kana

Golang library for conversion between Japanese hiragana, katakana and romaji
Go
5
star
14

omgroflol

A small omgrofl program just for lulz
Omgrofl
5
star
15

japanese

Go (golang) package for Japanese grammar
Go
3
star
16

stats

Golang Statistics-related Functions
Go
3
star
17

hackernews

Modern Go client for the Hacker News API
Go
3
star
18

catvalidate

A simple example Decanter app to demonstrate JSON Validation for PyCON APAC 2013
Python
3
star
19

sqwiggle

Go client library for Sqwiggle
Go
2
star
20

bestsellers

Go API Client for the New York Times Best Seller List
Go
2
star
21

pinyin-input

Go
2
star
22

cq-source-xkcd

Load data from the XKCD API into PostgreSQL, Elasticsearch, BigQuery, CSV, and many more
Go
2
star
23

django-cms-markdown

django-cms-markdown
2
star
24

cq-source-chess-com

Load games from the Chess.com API into any destination database (PostgreSQL, MySQL, etc)
Go
2
star
25

coverrace

A simple demonstration of the perils of using -cover and -race together
Go
2
star
26

chinese-api-old

The official ChineseLevel API - RESTful, open-source and written in python.
Python
1
star
27

turbo-octo-wookie

A playground for getting to know Python's NLTK, and doing some cool stuff
Python
1
star
28

go-mafan

A Go library for splitting Chinese text
Go
1
star
29

language-detection

a research project for improving language detection algorithms
Go
1
star
30

mapplot

An interactive Google Maps to see where and how far you can travel in a certain time
JavaScript
1
star
31

segmentation

Word segmentation based on a dictionary and word frequencies
Go
1
star
32

cedict-parser

A Python CEdict parser
Python
1
star
33

topsites

Analysis of Alexa's list of 1,000,000 top websites
Go
1
star