• Stars
    star
    946
  • Rank 48,319 (Top 1.0 %)
  • Language
    Java
  • Created over 9 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie.

AhoCorasickDoubleArrayTrie

Maven Central GitHub release License

An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie structure. Its speed is 5 to 9 times of naive implementations, perhaps it's the fastest implementation so far ;-)

Introduction

You may heard that Aho-Corasick algorithm is fast for parsing text with a huge dictionary, for example:

  • looking for certain words in texts in order to URL link or emphasize them
  • adding semantics to plain text
  • checking against a dictionary to see if syntactic errors were made

But most implementation use a TreeMap<Character, State> to store the goto structure, which costs O(lg(t)) time, t is the largest amount of a word's common prefixes. The final complexity is O(n * lg(t)), absolutely t > 2, so n * lg(t) > n . The others used a HashMap, which wasted too much memory, and still remained slowly.

I improved it by replacing the XXXMap to a Double Array Trie, whose time complexity is just O(1), thus we get a total complexity of exactly O(n), and take a perfect balance of time and memory. Yes, its speed is not related to the length or language or common prefix of the words of a dictionary.

This implementation has been widely used in my HanLP: Han Language Processing package. I hope it can serve as a common data structure library in projects handling text or NLP task.

Dependency

Include this dependency in your POM. Be sure to check for the latest version in Maven Central.

<dependency>
  <groupId>com.hankcs</groupId>
  <artifactId>aho-corasick-double-array-trie</artifactId>
  <version>1.2.3</version>
</dependency>

or include this dependency in your build.gradle.kts

implementation("com.hankcs:aho-corasick-double-array-trie:1.2.2")

Usage

Setting up the AhoCorasickDoubleArrayTrie is a piece of cake:

        // Collect test data set
        TreeMap<String, String> map = new TreeMap<String, String>();
        String[] keyArray = new String[]
                {
                        "hers",
                        "his",
                        "she",
                        "he"
                };
        for (String key : keyArray)
        {
            map.put(key, key);
        }
        // Build an AhoCorasickDoubleArrayTrie
        AhoCorasickDoubleArrayTrie<String> acdat = new AhoCorasickDoubleArrayTrie<String>();
        acdat.build(map);
        // Test it
        final String text = "uhers";
        List<AhoCorasickDoubleArrayTrie.Hit<String>> wordList = acdat.parseText(text);

Of course, there remains many useful methods to be discovered, feel free to try:

  • Use a Map<String, SomeObject> to assign a SomeObject as value to a keyword.
  • Store the AhoCorasickDoubleArrayTrie to disk by calling save method.
  • Restore the AhoCorasickDoubleArrayTrie from disk by calling load method.
  • Use it in concurrent code. AhoCorasickDoubleArrayTrie is thread safe after build method

In other situations you probably do not need a huge wordList, then please try this:

        acdat.parseText(text, new AhoCorasickDoubleArrayTrie.IHit<String>()
        {
            @Override
            public void hit(int begin, int end, String value)
            {
                System.out.printf("[%d:%d]=%s\n", begin, end, value);
            }
        });

or a lambda function

        acdat.parseText(text, (begin, end, value) -> {
            System.out.printf("[%d:%d]=%s\n", begin, end, value);
        });

Comparison

I compared my AhoCorasickDoubleArrayTrie with robert-bor's aho-corasick, ACDAT represents for AhoCorasickDoubleArrayTrie and Naive represents for aho-corasick, the result is :

Parsing English document which contains 3409283 characters, with a dictionary of 127142 words.
               	Naive          	ACDAT
time           	607            	102
char/s         	5616611.20     	33424343.14
rate           	1.00           	5.95
===========================================================================
Parsing Chinese document which contains 1290573 characters, with a dictionary of 146047 words.
               	Naive          	ACDAT
time           	319            	35
char/s         	2609156.74     	23780600.00
rate           	1.00           	9.11
===========================================================================

In English test, AhoCorasickDoubleArrayTrie is 5 times faster. When it comes to Chinese, AhoCorasickDoubleArrayTrie is 9 times faster. This test is conducted under i7 2.0GHz, -Xms512m -Xmx512m -Xmn256m. Feel free to re-run this test in TestAhoCorasickDoubleArrayTrie, the test data is ready for you.

Thanks

This project is inspired by aho-corasick and darts-clone-java. Many thanks!

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Python
33,681
star
2

pyhanlp

中文分词
Python
3,122
star
3

CS224n

CS224n: Natural Language Processing with Deep Learning Assignments Winter, 2017
Python
673
star
4

Viterbi

An implementation of HMM-Viterbi Algorithm 通用的维特比算法实现
Java
369
star
5

multi-criteria-cws

Simple Solution for Multi-Criteria Chinese Word Segmentation
Python
300
star
6

hanlp-lucene-plugin

HanLP中文分词Lucene插件,支持包括Solr在内的基于Lucene的系统
Java
296
star
7

TextRank

TextRank算法提取关键词的Java实现
Java
201
star
8

LDA4j

A Java implemention of LDA(Latent Dirichlet Allocation)
Java
195
star
9

TreebankPreprocessing

Python scripts preprocessing Penn Treebank and Chinese Treebank
Python
162
star
10

ID-CNN-CWS

Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation"
Python
136
star
11

MainPartExtractor

主谓宾提取器的Java实现(对斯坦福的代码失去兴趣,不再维护)
Java
136
star
12

neural_net

反向传播神经网络及应用
Python
82
star
13

udacity-deep-learning

Assignments for Udacity Deep Learning class with TensorFlow in PURE Python, not IPython Notebook
Python
66
star
14

AveragedPerceptronPython

Clone of "A Good Part-of-Speech Tagger in about 200 Lines of Python" by Matthew Honnibal
Python
49
star
15

MaxEnt

这是一个最大熵的简明Java实现,提供提供训练与预测接口。训练算法采用GIS训练算法,附带示例训练集和一个天气预测的Demo。
Java
46
star
16

text-classification-svm

The missing SVM-based text classification module implementing HanLP's interface
Java
46
star
17

IceNAT

IceNAT
Java
32
star
18

BERT-token-level-embedding

Generate BERT token level embedding without pain
Python
28
star
19

sub-character-cws

Sub-Character Representation Learning
Python
26
star
20

HanLPAndroidDemo

HanLP Android Demo
Java
22
star
21

maxent_iis

最大熵-IIS(Improved Iterative Scaling)训练算法的Java实现
Java
18
star
22

gohanlp

Golang RESTful Client for HanLP
Go
13
star
23

iparser

Yet another dependency parser, integrated with tokenizer, tagger and visualization tool.
Python
11
star
24

DeepBiaffineParserMXNet

An experimental implementation of biaffine parser using MXNet
Python
10
star
25

OpenCC-to-HanLP

无损转换OpenCC词典为HanLP格式
Python
9
star
26

tmsvm

Python
1
star
27

bolt_splits

Split Broad Operational Language Translation corpus into train/dev/test set
Python
1
star