• Stars
    star
    147
  • Rank 251,347 (Top 5 %)
  • Language
    Python
  • License
    Other
  • Created over 12 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Han character library for CJKV languages
===========================
Installing and Using Cjklib
===========================

.. contents::

Introduction
============
Cjklib provides language routines related to Han characters (characters based
on Chinese characters named Hanzi, Kanji, Hanja and chu Han respectively) used
in writing of the Chinese, the Japanese, infrequently the Korean and formerly
the Vietnamese language(s). Functionality is included for character
pronunciations, radicals, glyph components, stroke decomposition and variant
information.

Dependencies
============
- Python_ 2.4 or above (currently no support for Python3)
- SQLite_ 3+
- SQLAlchemy_ 0.5+
- pysqlite2_ (already ships with Python 2.5 and above)

Alternatively for MySQL as backend:

- MySQL_ 5+
- MySQL-Python_

.. _Python: http://www.python.org/download/
.. _SQLite: http://www.sqlite.org/download.html
.. _MySQL: http://www.mysql.com/downloads/mysql/
.. _SQLAlchemy: http://www.sqlalchemy.org/download.html
.. _pysqlite2: http://code.google.com/p/pysqlite/downloads/list
.. _MySQL-Python: http://sourceforge.net/projects/mysql-python/

Installing
==========

Windows
-------
Install cjklib using the provided ``.exe`` installer. Make sure above
dependencies are satisfied.

Three scripts ``cjknife.exe``, ``buildcjkdb.exe``, and ``installcjkdict.exe``
will be added to the Python ``Scripts`` sub-directory. Make sure this directory
is included in your ``PATH`` environment variable to access these programs from
the command line.

CJK dictionaries are not included by default. If you want to install any of
those run the following (with an Internet connection) from the root directory
of the source package::

    $ installcjkdict CEDICT

This will download CEDICT, create a SQLite database file and install it under
the directory given by the ``APPDATA`` environment variable, e.g.
``C:\windows\profiles\MY_USER\Application Data\cjklib``. Just substitute
``CEDICT`` for any other supported dictionary (i.e. EDICT, CEDICT, HanDeDict,
CFDICT, CEDICTGR).

Unix
----
If you are installing from the source package you need to deploy the library on
your system::

    $ sudo python setup.py install

Also make sure above dependencies are satisfied. CJK dictionaries are not
included by default. If you want to install any of those run the following
(with an Internet connection)::

    $ sudo installcjkdict CEDICT

This will download CEDICT, create a SQLite database file and install it to
``/usr/local/share/cjklib``. Just substitute ``CEDICT`` for any other supported
dictionary (i.e. EDICT, CEDICT, HanDeDict, CFDICT, CEDICTGR).


Documentation & Usage
=====================
Documentation_ is available online. Also see the `project page`_ and its wiki.
There is a small command line tool ``cjknife`` that offers some of the library's
functions. See ``cjknife --help`` for an overview.

.. _Documentation: http://cjklib.org/
.. _project page: http://code.google.com/p/cjklib/

Examples
--------

- Get stroke order of characters::

    >>> from cjklib import characterlookup
    >>> cjk = characterlookup.CharacterLookup('C')
    >>> cjk.getStrokeOrder(u'说')
    [u'㇔', u'㇊', u'㇔', u'㇒', u'㇑', u'㇕', u'㇐', u'㇓', u'㇟']

- Access a dictionary (here using Jim Breen's EDICT)::

    >>> from cjklib.dictionary import EDICT
    >>> d = EDICT()
    >>> d.getForTranslation('Tokyo')
    [EntryTuple(Headword=u'東京', Reading=u'とうきょう',
    Translation=u'/(n) Tokyo (current capital of Japan)/(P)/')]


Database
========
Packaged versions of the library will ship with a pre-built SQLite database
file. You can however easily rebuild the database yourself.

First download the newest Unihan file::

    $ wget ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zip

Then start the build process::

    $ sudo buildcjkdb -r build cjklibData

SQLite
------
SQLite by default has no Unicode support for string operations. Optionally the
ICU library can be compiled in for handling alphabetic non-ASCII characters.
Cjklib can register own Unicode functions if ICU support is missing. Queries
with ``LIKE`` will then use function ``lower()``. This compatibility mode has
negative impact on performance and as it is not needed for dictionaries like
EDICT or CEDICT it is disabled by default. See ``cjklib.conf`` for enabling.

MySQL
-----
With MySQL 5 the following ``CREATE`` command creates a database with ``utf8``
as character set using the general Unicode collation
(MySQL from 5.5.3 on will support full Unicode given character set
``utf8mb4`` and collation ``utf8mb4_bin``)::

    CREATE DATABASE cjklib DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;

You might need to set access rights, too (substitute ``user_name`` and
``host_name``)::

    GRANT ALL ON cjklib.* TO 'user_name'@'host_name';

Now update the settings in  ``cjklib.conf``.

MySQL < 5.5 doesn't support full UTF-8, and uses a version with max 3 bytes, so
characters outside the Basic Multilingual Plane (BMP) can't be encoded. Building
the Unihan database thus might result in warnings, characters above U+FFFF
can't be built at all. You need to disable building the full character range
by setting ``wideBuild`` to ``False`` in ``cjklib.conf`` before building.
Alternatively pass ``--wideBuild=False`` to ``buildcjkdb``.


Contact
=======
For help or discussions on cjklib, join `[email protected]
<http://groups.google.com/group/cjklib-devel>`_.

Please report bugs to the `project's bug tracker
<http://code.google.com/p/cjklib/issues/list>`_.

More Repositories

1

rasterizeHTML.js

Renders HTML into the browser's canvas
JavaScript
2,451
star
2

csscritic

Lightweight CSS regression testing
JavaScript
481
star
3

buildviz

Transparency for your build pipeline's results and runtime
Clojure
108
star
4

json-path-comparison

Comparison of the different implementations of JSONPath and language agnostic test suite.
Shell
84
star
5

xmlserializer

xmlserializer serializes a DOM subtree or DOM document into XML/XHTML
JavaScript
36
star
6

upsidedown

Simple Python module that "flips" latin characters in a string to create an "upside-down" impression
Python
27
star
7

ayepromise

A teeny-tiny promise library
JavaScript
27
star
8

pdfserver

Pdfserver is a webservice that offers common PDF operations like joining documents, selecting pages or "n pages on one".
Python
17
star
9

jquery-shiftenter

Submit your textarea through a simple press on 'Enter' with jQuery
JavaScript
14
star
10

inlineresources

Inlines style sheets, images, fonts and scripts in HTML documents.
JavaScript
13
star
11

build-facts

Dump your build pipeline's data for inspection
Clojure
13
star
12

jp

A simpler jq, and with JSONPath
Shell
11
star
13

surf

Old Git fork of the SuRF Object RDF Mapper
Python
10
star
14

greenyet

Microservices status dashboard
Clojure
10
star
15

csscritic-examples

Helping you set up CSS Critic with your project
JavaScript
9
star
16

css-font-face-src

A CSS @font-face src property value parser
TypeScript
8
star
17

eclectus

Han character dictionary
Python
7
star
18

hwr

Clone of Tegaki project
Python
6
star
19

tegaki

Chinese and Japanese Handwriting Recognition
Python
6
star
20

deniz

A simple web app for browsing RDF data
JavaScript
6
star
21

sparqlprotocolproxy

Small SPARQL protocol proxy server
Python
4
star
22

tinycors

A dead simple CORS proxy
JavaScript
2
star
23

django-wikify

django-wikify is a lightweight module to turn your static Django model views into full wiki pages.
Python
2
star
24

slow-promise

Slow down your promises
JavaScript
2
star
25

basic-auth-proxy

A tiny proxy forcing basic-auth
JavaScript
1
star
26

html2xhtml.js

HTML to XHTML converter
JavaScript
1
star