• Stars
    star
    243
  • Rank 166,489 (Top 4 %)
  • Language
    C
  • License
    Other
  • Created over 12 years ago
  • Updated almost 12 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Yet another Chinese word segmentation package based on character-based tagging heuristics and CRF algorithm

GkSeg: yet another Chinese word segmentation package

GkSeg is a Chinese word segmentation package shipped by Guokr.com. It is based on character-based tagging heuristics and CRF algorithm.

Currently it only support Linux platform.

Features

  • Precise: > 94%
  • Scope: modern Chinese text, and even classic Chinese text(文言文)
  • Terms auto-extraction: It can extract important terms from the text
  • No dictionaries: See the section for character-based tagging heuristics
  • Performance is good: 4 times slower than mmseg, but we support more features
  • Training tool for the CRF model is also shipped in the same package

Character-based tagging heuristics

Character-based tagging heuristics is invented by N. Xue and others, and published at SIGHAN 2002 [Xue et al., 2002]

The basic idea is to mark each character in a sentence with its kind:

  • b: begining character of a word
  • m: middle character of a word
  • e: end character of a word
  • s: single character to form a word

And then using the marked corpus to train the segmentation program.

At conceptual level, we can treat its ability for segmenting from the inner pattern of Chinese language.

Interestingly, when we use the tool to segment classic Chinese text, it achieved a good performance. That is to say, the inner pattern of Chinese language is not vary greatly during the time.

CRF algorithm

Conditional random fields ( from http://en.wikipedia.org/wiki/Conditional_random_field )

Conditional random fields (CRFs) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. Whereas an ordinary classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF can take context into account; e.g., the linear chain CRF popular in natural language processing predicts sequences of labels for sequences of input samples.

We use wapiti package from LIMSI-CNRS, it is a very neat CRF package ( http://wapiti.limsi.fr/ )

We changed wapiti package a little by our requirements.

Installation

Please follow below steps:

git clone git://github.com/guokr/gkseg.git gkseg

cd gkseg/wapiti

make

Now it is ready, you can use the tools provided by this package directly.

Usage for the tools

All the tools located under the bin directory

gkseg: segment a text into words

  • gkseg <text>

gksegd: start a webserver to segment words by restful api

  • gksegd

gksegt: trainning the tool

  • gksegt add <basedir> <aspect> <trainfile>
  • gksegt train <trainfile> <modelfile>

Using the API

Before using the API, you should intialize the program first, and then perform the segmentation, and finally destroy the program.

import gkseg

text = '话说天下大势,分久必合,合久必分'.decode('utf-8')

gkseg.init()

print gkseg.seg(text) #segment the sentence into a list of words

print gkseg.term(text) #extract the important words from the sentence

print gkseg.label(text) #label the sentence

gkseg.destory()

The training process

Step 1: prepare the training input

  • gksegt add <basedir> <aspect> <trainfile>

Here we have

  • <basedir>: The base path of the training corpus
  • <aspect>: A specified aspect of the training corpus, see below corpus section
  • <trainfile>: The target training file

Step 2: training the input file to get the model

  • gksegt train <trainfile> <modelfile>

Here we have

  • <trainfile>: The training file as input
  • <modelfile>: The model file as output

The format of training corpus

In logic, a corpus is a set of files organized in several aspect. And in physics, a training corpus must be organized into the following way:

  • A top folder with an index.txt file, in the index file it gives all the aspects and filename list in the corpus.
  • An aspect is a subfolder contains all the files.

You can check the example at https://github.com/guokr/corpus/tree/master/zhxs

The python module - gkcrp - in this package can be used to deal with this corpus format.

Just as showed in the demo at https://github.com/guokr/corpus/tree/master/zhxs , we have two aspect - original and labeled. in labeled folder, we give all the articles labeled by the mark "m" to hightlight the important keywords.

Contributors

  • Mingli Yuan (mountain at github)
  • Rui Wang (isnowfy at github)

License

  • MIT license for the main part of the project
  • wapiti is under its own license
  • uthash is under BSD license

More Repositories

1

swagger-py-codegen

a Python web framework generator supports Flask, Tornado, Falcon, Sanic
Python
554
star
2

simbase

A vector similarity database
Java
230
star
3

stan-cn-nlp

stan-cn-nlp: an API wrapper based on Stanford NLP packages for the convenience of Chinese users
Java
56
star
4

Brief

In a nutshell, this is a Text Summarizer
Python
42
star
5

Caver

Caver: a toolkit for multilabel text classification.
Python
39
star
6

redis-namespace

namespaced subset of your redis keyspace
Python
22
star
7

G.js

A simple javascript module loader from Guokr.com
JavaScript
20
star
8

corpus

An open corpus for Chinese NLP study
16
star
9

TorchCTR

CTR Prediction on PyTorch
Python
14
star
10

stan-cn-ner

A Chinese naming entity recognization package in stan-cn-* family
Java
14
star
11

guokr-build

A build tool for frontend developer from guokr.com
JavaScript
10
star
12

guokr

guokr modules
JavaScript
9
star
13

asynx

An open source, distributed, and web / HTTP oriented taskqueue & scheduler service inspired by GAE
Python
6
star
14

stan-cn-seg

A Chinese word segmentation package in stan-cn-* family
ActionScript
6
star
15

clj-cn-nlp

A clojure wrapper for Stanford CoreNLP package based on stan-cn-nlp Java wrapper for Simplified Chinese users
Clojure
6
star
16

neuseg

An experimental Chinese word segmentation tool based on vector model and neurual networks
Java
5
star
17

simbase-clj

A clojure client for simbase
Clojure
5
star
18

wikicrawl

A crawler to achieve the category structure of wikipedia
Clojure
4
star
19

tsuru-postgresapi

A PostgreSQL API for tsuru PaaS
4
star
20

hebo

A dataflow scheduler based on cascalog for hadoop tasks
Java
3
star
21

stan-cn-tag

A Chinese POS tagging package in stan-cn-* family
Java
3
star
22

stan-cn-com

A common base for stan-cn-* package family.
Java
3
star
23

string-demon

Python
2
star
24

usher-heartbeat

register to the usher and keep the heartbeat
Python
1
star