• Stars
    star
    431
  • Rank 100,866 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 10 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Simple conversion and localization between simplified and traditional Chinese using tables from MediaWiki.

简易中文简繁转换

文档

zhconv 提供基于 MediaWiki 和 OpenCC 词汇表的最大正向匹配简繁转换,支持地区词转换:zh-cn, zh-tw, zh-hk, zh-sg, zh-hans, zh-hant。Python 2、3通用。

若要求高精确度,参见 OpenCCopencc-python

>>> print(convert(u'我幹什麼不干你事。', 'zh-cn'))
我干什么不干你事。
>>> print(convert(u'人体内存在很多微生物', 'zh-tw'))
人體內存在很多微生物

其中,zh-hans, zh-hant 仅转换简繁,不转换地区词。

完整支持 MediaWiki 人工转换语法:

>>> print(convert_for_mw(u'在现代,机械计算-{}-机的应用已经完全被电子计算-{}-机所取代', 'zh-hk'))
在現代,機械計算機的應用已經完全被電子計算機所取代
>>> print(convert_for_mw(u'-{zh-hant:資訊工程;zh-hans:计算机工程学;}-是电子工程的一个分支,主要研究计算机软硬件和二者间的彼此联系。', 'zh-tw'))
資訊工程是電子工程的一個分支,主要研究計算機軟硬體和二者間的彼此聯繫。
>>> print(convert_for_mw(u'張國榮曾在英國-{zh:利兹;zh-hans:利兹;zh-hk:列斯;zh-tw:里茲}-大学學習。', 'zh-sg'))
张国荣曾在英国利兹大学学习。
>>> print(convert_for_mw('毫米(毫公分),符號mm,是長度單位和降雨量單位,-{zh-hans:台湾作-{公釐}-或-{公厘}-;zh-hant:港澳和大陸稱為-{毫米}-(台灣亦有使用,但較常使用名稱為毫公分);zh-mo:台灣作-{公釐}-或-{公厘}-;zh-hk:台灣作-{公釐}-或-{公厘}-;}-。', 'zh-cn'))
毫米(毫公分),符号mm,是长度单位和降雨量单位,台湾作公釐或公厘。

和其他高级字词转换语法

转换字典可下载 MediaWiki 源码包中的 includes/ZhConversion.php,使用 convmwdict.py 可转换成 json 格式。

代码授权协议采用 MIT 协议;转换表由于来自 MediaWiki,为 GPLv2+ 协议。

在Spark集群中使用该项目

在分布式集群中,也许受环境限制,不便于在每台机器上安装该项目。 那么你可以从driver机器中单独上传该项目的egg文件,不需要依赖于其它的项目。

# python setup.py bdist_egg

# ls dist
zhconv-1.2.2-py2.7.egg

如果在本地,则可以直接执行sys.path.append('PATH_TO_ZHCONV/zhconv-1.2.2-py2.7.egg')后使用。

小工具

EPUB 电子书简繁转换:python3 epubzhconv.py 输入.epub 输出.epub zh-{cn,tw}


Simple Chinese Conversion Library

zhconv converts between Simplified and Traditional Chinese using maximum forward matching. The conversion table is based on MediaWiki and OpenCC. Supports regional vocabulary: zh-cn, zh-tw, zh-hk, zh-sg, zh-hans, zh-hant. Supports both Python 2 and 3.

Example:

>>> print(convert(u'我幹什麼不干你事。', 'zh-cn'))
我干什么不干你事。
>>> print(convert(u'人体内存在很多微生物', 'zh-tw'))
人體內存在很多微生物

If zh-hans or zh-hant is used, then regional vocabulary conversion will be disabled.

Documentation is available in Chinese.

The code is licensed under MIT, while the conversion table is licensed under GPLv2+.

More Repositories

1

ptproxy

Turn any pluggable transport for Tor into an obfuscating TCP tunnel.
Python
160
star
2

wqxt_pdf

WQXT PDF Downloader
Python
96
star
3

tms2geotiff

Download tiles from Tile Map Server (online maps) and make a large geo-referenced image
Python
82
star
4

tg-export

Export Telegram messages.
Python
57
star
5

htmllisting-parser

Python parser for Apache/nginx-style HTML directory listing
Python
32
star
6

tessdata_chi

Retrained Tesseract OCR model for Chinese
Python
30
star
7

cntms

Tile Map Server reverse proxy with coordinates regularization
Python
27
star
8

tg-irc-relay

Relay between Telegram groups and IRC channels.
Python
23
star
9

orizonhub

Connect groups across protocols with logging and command support.
Python
14
star
10

NiceChordNotes

Notes for NiceChord.com tutorials
Makefile
12
star
11

tg-chatdig

Dig into long and boring Telegram group chat logs.
Python
10
star
12

ffplaylist

Streaming dynamic playlist simply using FFMpeg
Python
10
star
13

chinesename

Generate Chinese name according to statistic model.
Python
9
star
14

trustedsleepbot

This bot records one's sleep time via online status on Telegram.
Python
8
star
15

pdfreduce

Reduce scanned PDF size by converting images to grey/black&white.
Python
8
star
16

pywebapps

Various web apps mainly written in Python.
Python
7
star
17

fossil-tools

Some tools for use with fossil scm.
Shell
7
star
18

stamico

Stamico, a hand-written font.
Python
6
star
19

tjqh

国家统计局统计用区划代码 / China administrative division code
Python
6
star
20

fossilpy

Simple python library for reading Fossil repositories
Python
6
star
21

maxpacker

A flexible backup tool, which can filter and pack files into independent partitions.
Python
4
star
22

leaflet.ChineseCRS

Leaflet plugin for using Chinese tile providers without offset.
JavaScript
4
star
23

quackalike

A script that attempts to generate text that looks like the training material.
Python
2
star
24

coinpricebot

Telegram bot to check cyptocoin prices
Python
2
star
25

googletest

Test available Google IPs.
Python
2
star
26

calibre-extract-isbn

Python 3 fork of the Extract ISBN plugin for Calibre
Python
2
star
27

base1k

Binary to text codecs using the most common 1k or 4k unambiguous Chinese characters.
Python
2
star
28

smtpinyin

Pinyin IME using Statistical Machine Translation
Python
2
star
29

whisper_vad

Whisper.cpp Speech-to-text with Voice Acticity Detection
C
2
star
30

pyimpsort

Sort the imports in Python scripts by length.
Python
1
star
31

webhookbot

Simple Telegram bot to accept Webhook requests.
Python
1
star
32

wikihistory

Generate a history book from Wikipedia dumps.
Python
1
star
33

fundanalyze

Analyze and make a portfolio of funds.
Python
1
star
34

mocorpus

A multilingual corpus collected from gettext .mo files in Debian.
Python
1
star
35

fxcalc

A simple calculator.
Python
1
star
36

stickerindexbot

Sticker index bot for Telegram
Python
1
star
37

sendtext

AutoHotKey script for pasting into any app using simulated keystrokes.
AutoHotkey
1
star