• Stars
    star
    220
  • Rank 180,422 (Top 4 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created about 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

任何 JS 环境可用的中文分词包,fork from leizongmin/node-segment

npm version Coverage Status CI Status Min Zip Size

中文分词模块

本模块基于 node-segment 魔改,增加了 electron、浏览器支持,并准备针对 electron 多线程运行环境进行优化。

之所以要花时间魔改,是因为 segmentnodejieba 虽然在 node 环境下很好用,但根本无法在浏览器和 electron 环境下运行。我把代码重构为 ES2015,并用 babel 插件内联了字典文件,全部载入的话大小是 3.8M,但如果有些字典你并不需要,字典和模块是支持 tree shaking 的(请使用 ESM 模块)。

Usage

Install

<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/umd/segmentit.min.js" />
npm i segmentit

Use

import { Segment, useDefault } from 'segmentit';

const segmentit = useDefault(new Segment());
const result = segmentit.doSegment('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作。');
console.log(result);

对于 runkit 环境:

const { Segment, useDefault } = require('segmentit');
const segmentit = useDefault(new Segment());
const result = segmentit.doSegment('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作。');
console.log(result);

在 Runkit 上免费试用

浏览器直接使用例:

首先请引用 “https://cdn.jsdelivr.net/npm/[email protected]/dist/umd/segmentit.js”

const segmentit = Segmentit.useDefault(new Segmentit.Segment());
const result = segmentit.doSegment('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作。');
console.log(result);

(其实就是把所有调用,初始化什么的全都加上 Segmentit. 就可以了)

获取词类标注

结巴分词风格的词类标注:

// import Segment, { useDefault, cnPOSTag, enPOSTag } from 'segmentit';
const { Segment, useDefault, cnPOSTag, enPOSTag } = require('segmentit');

const segmentit = useDefault(new Segment());

console.log(segmentit.doSegment('一人得道,鸡犬升天').map(i => `${i.w} <${cnPOSTag(i.p)}> <${enPOSTag(i.p)}>`))
// ↑ ["一人得道 <习语,数词 数语素> <l,m>", ", <标点符号> <w>", "鸡犬升天 <成语> <i>"]

只使用部分词典或使用自定义词典

useDefault 的具体实现是这样的:

// useDefault
import { Segment, modules, dicts, synonyms, stopwords } from 'segmentit';

const segmentit = new Segment();
segmentit.use(modules);
segmentit.loadDict(dicts);
segmentit.loadSynonymDict(synonyms);
segmentit.loadStopwordDict(stopwords);

因此你实际上可以 import 所需的那部分字典和模块,然后一个个如下载入。没有 import 的那些字典和模块应该会被 webpack 的 tree shaking 去掉。你也可以这样载入自己定义的字典文件,只需要主要 loadDict 的函数签名是 (dicts: string | string[]): Segment

// load custom module and dicts
import {
  Segment,
  ChsNameTokenizer,
  DictOptimizer,
  EmailOptimizer,
  PunctuationTokenizer,
  URLTokenizer,
  ChsNameOptimizer,
  DatetimeOptimizer,
  DictTokenizer,
  ForeignTokenizer,
  SingleTokenizer,
  WildcardTokenizer,
  pangu,
  panguExtend1,
  panguExtend2,
  names,
  wildcard,
  synonym,
  stopword,
} from 'segmentit';

const segmentit = new Segment();

// load them one by one, or by array
segmentit.use(ChsNameTokenizer);
segmentit.loadDict(pangu);
segmentit.loadDict([panguExtend1, panguExtend2]);
segmentit.loadSynonymDict(synonym);
segmentit.loadStopwordDict(stopword);

盘古的词典比较复古了,像「软萌萝莉」这种词都是没有的,请有能力的朋友 PR 一下自己的词库。

创造自己的分词中间件(Tokenizer)和结果优化器(Optimizer)

Tokenizer

Tokenizer 是分词时要经过的一个个中间件,类似于 Redux 的 MiddleWare,它的 split 函数接受分词分到一半的 token 数组,返回一个同样格式的 token 数组(这也就是不要对太长的文本分词的原因,不然这个数组会巨爆大)。

例子如下:

// @flow
import { Tokenizer } from 'segmentit';
import type { SegmentToken, TokenStartPosition } from 'segmentit';
export default class ChsNameTokenizer extends Tokenizer {
  split(words: Array<SegmentToken>): Array<SegmentToken> {
    // 可以获取到 this.segment 里的各种信息
    const POSTAG = this.segment.POSTAG;
    const TABLE = this.segment.getDict('TABLE');
    // ...
  }

Optimizer

Optimizer 是在分词结束后,发现有些难以利用字典处理的情况,却可以用启发式规则处理时,可以放这些启发式规则的地方,它的 doOptimize 函数同样接收一个 token 数组,返回一个同样格式的 token 数组。

除了 token 数组以外,你还可以自定义余下的参数,比如在下面的例子里,我们会递归调用自己一次,通过第二个参数判断递归深度:

// @flow
import { Optimizer } from './BaseModule';
import type { SegmentToken } from './type';
export default class DictOptimizer extends Optimizer {
  doOptimize(words: Array<SegmentToken>, isNotFirst: boolean): Array<SegmentToken> {
    // 可以获取到 this.segment 里的各种信息
    const POSTAG = this.segment.POSTAG;
    const TABLE = this.segment.getDict('TABLE');
    // ...
    // 针对组合数字后无法识别新组合的数字问题,需要重新扫描一次
    return isNotFirst === true ? words : this.doOptimize(words, true);
  }

例如目前各种分词工具都没法把「一条红色内裤」中的红色标对词性,但在 segmentit 里我加了个简单的 AdjectiveOptimizer 来处理它:

// @flow
// https://github.com/linonetwo/segmentit/blob/master/src/module/AdjectiveOptimizer.js
import { Optimizer } from './BaseModule';
import type { SegmentToken } from './type';

import { colors } from './COLORS';

// 把一些错认为名词的词标注为形容词,或者对名词作定语的情况
export default class AdjectiveOptimizer extends Optimizer {
  doOptimize(words: Array<SegmentToken>): Array<SegmentToken> {
    const { POSTAG } = this.segment;
    let index = 0;
    while (index < words.length) {
      const word = words[index];
      const nextword = words[index + 1];
      if (nextword) {
        // 对于<颜色>+<的>,直接判断颜色是形容词(字典里颜色都是名词)
        if (nextword.p === POSTAG.D_U && colors.includes(word.w)) {
          word.p = POSTAG.D_A;
        }
        // 如果是连续的两个名词,前一个是颜色,那这个颜色也是形容词
        if (word.p === POSTAG.D_N && nextword.p === POSTAG.D_N && colors.includes(word.w)) {
          word.p = POSTAG.D_A;
        }
      }
      // 移到下一个单词
      index += 1;
    }
    return words;
  }
}

License

MIT LICENSED

More Repositories

1

langchain-alpaca

Run Alpaca LLM in LangChain
TypeScript
215
star
2

ChatGPT-Magic-Chat

使唤 AI 使魔必备魔咒,一些验证可用的操作 ChatGPT 的咒语。
JavaScript
76
star
3

yaba-japanese

矢波日语扫描版
71
star
4

Relay-Tutorial-Chinese

Facebook React Relay 中文入门教程
JavaScript
42
star
5

template-based-generator-template

基于模板的文本生成器的模板,模生模,凤生凤,老鼠的儿子会打洞。本地启动:npm i && npm run dev:demo
TypeScript
36
star
6

neo4j-tutorial-Chinese

学图论数据库 Neo4j 的时候顺手翻译了它的在线课程
35
star
7

communism-report-generator

共产中文报告生成器,用于生成有伏特加味的参水报告
TypeScript
34
star
8

ipfs-browser-gateway

An IPFS gateway without server, by utilizing service worker.
JavaScript
32
star
9

hyper-visual

CLI commands from the history and the context now listed in a clickable GUI.
JavaScript
20
star
10

Rising-of-the-Eternity

《永恒开端》此游戏关乎在高维时空中进行时间旅行,避免时空悖论毁灭整个「永恒时空基金会」,保护历史不受时间刺客的破坏。你是永恒之人。
TypeScript
18
star
11

solid-tiddlywiki-syncadaptor

Sync TiddlyWiki to SoLiD Server.
JavaScript
17
star
12

Babel-Library2

巴别塔图书馆2 Babel-Library 2是一个探索自动生成文本图书馆的游戏。图书馆宇宙的起源是什么,不同的视角或许会有不同的宇宙观。
TypeScript
16
star
13

electron-ipc-cat

Passing object and type between Electron main process and renderer process simply via preload script.
TypeScript
16
star
14

styled-tachyons

Mix tachyons into styled-components.
JavaScript
16
star
15

MOSS-DockerFile

用于在 Docker 里运行复旦的 MOSS 语言模型,使用 GradIO 提供 WebUI。
Python
14
star
16

ipfs-uploader-browser

Upload file to IPFS in browser, by automatically connect to peer daemons.
JavaScript
13
star
17

ruff-babel-starter-kit

Transform ES6+ to ruff runable code
JavaScript
10
star
18

react-relay-neo4j-example

Using Relay to fetch data from graph Database Neo4j then display on React.
JavaScript
10
star
19

itonnote

事体笔记,基于事体理论整合模因的原型笔记系统 (no suitable English translation...)
JavaScript
7
star
20

tiddlywiki-quickadd-android

Quick add message, and sync to nodejs tiddlywiki when you are home. (WIP ! I'm still learning android )
Java
6
star
21

DarkDaysArch

An Architecture helper and mod content Designer for Cataclysm-DarkDaysAhead (CDDA).
Rust
5
star
22

wiki

https://onetwo.ren/wiki
JavaScript
5
star
23

dao-gen-one

道生零,零生一,零一生道^2 。又一个Markdown项目。
4
star
24

Starbound-RPG-Growth-Chinese

Starbound Mod RPG Growth 汉化
JavaScript
4
star
25

react-encompass-ecs

React state management with encompass-ecs.
TypeScript
4
star
26

menkouchaomian

帮大学门口炒饭大叔做的APP,方便他速记
JavaScript
4
star
27

zhihu-zaned

保存、搜索你自己在知乎上赞过的东西,还有关注过的话题
JavaScript
4
star
28

neo4j-hypergraph

HyperGraph Driver for Neo4J, capable for building semantic web. [WIP]
JavaScript
4
star
29

COOL-to-JavaScript

Compile ClassroomObjectOrientedLanguage to EcmaScript6+, and provide a single .html playground for GUI user.
JavaScript
3
star
30

automata-tools

Tools to build automata from your custom rule
Python
3
star
31

pants-control

Count on how many pants are left, scheduling a laundry.
JavaScript
3
star
32

zhihu-nlp-playground

Do some annotation on zhihu dataset.
Jupyter Notebook
3
star
33

solid-box

SoLiD desktop app, start a SoLiD POD on your local environment.
JavaScript
3
star
34

bluetooth-reader-app

Read buletooth and display data.
JavaScript
3
star
35

chatting-question-generation

利用互联数据生成聊天话题
3
star
36

zazu-translation

Translate and optionally add to anki
JavaScript
3
star
37

new-japanese-concise-tutorial-interactive

Interactive Note for New Japanese Concise Tutorial | 《新编日语简明教程》的交互式笔记
TypeScript
2
star
38

sowiki

Distributed (Federated) wiki that user own their data.
2
star
39

firefox-profile-reader

Parse firefox places.sqlite and bookmarkbackups/xxx.mozlz4
TypeScript
1
star
40

token-regex

Data extraction on tokens of segmented sentences.
JavaScript
1
star
41

ruff-textbook

Textbook about how to develop with Ruff [WIP] [RFC]
JavaScript
1
star
42

linonetwo.github.io

Blog
HTML
1
star
43

rescript-service-worker-loader

Enable serviceworker-webpack-plugin for your create-react-app project.
JavaScript
1
star
44

rescript-disable-eslint

Disable preflight eslint check, so you can just check it in your code editor and CI.
JavaScript
1
star
45

rescript-worker-loader

Enable web worker-loader for your create-react-app project.
JavaScript
1
star
46

try-passportjs-gitlab-tiddlywiki

Try to get tiddlywiki run behind gitlab oauth2
HTML
1
star
47

rule-engine-playground

Several way to write IoT rule engine
Elixir
1
star
48

BrainHoleOfInfinityMeme

小说《无限模因》开脑洞锻炼思维延缓老年痴呆
1
star
49

zazu-tiddlywiki

Search a NodeJS hosted TiddlyWiki KnowledgeBase.
1
star
50

webpack5-externals-plugin

Webpack 5+ fork of Webpack-Externals-Plugin
JavaScript
1
star
51

sermover

Simple static file server on your mobile phone, based on IPFS.
JavaScript
1
star
52

cdda-chinese-text-dataset

翻译过的中文 CDDA 文本数据集,去掉了知识图谱部分和数值部分,只保留物品名字、描述等文本。
JavaScript
1
star
53

create-react-app-rewired-babel-ts

Use create-react-app ( CRA ) with babel and typescript
JavaScript
1
star
54

fs-transaction

fs with rollback and commit, suitable for letting filesystem in sync with database.
JavaScript
1
star
55

complexion-reduction-ui

Incubation repo for React Components that Implement Complexion Reduction Design. [RFC]
CSS
1
star