• Stars
    star
    130
  • Rank 277,575 (Top 6 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 11 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Finite state dictionaries in Java

dictomaton

Introduction

This Java library implements dictionaries that are stored in finite state automata. Dictomaton has the following features:

  • Finite state dictionaries that implement the Java Set interface.
  • Perfect hash dictionaries, that provide a unique hash for each character sequence that is in the dictionary. Perfect hash dictionaries can be used in two directions: (1) obtaining the hash code for a character sequence and (2) obtaining the character sequence for a hash code.
  • Levenshtein automata, that allow you to efficiently find all the sequences in the dictionary that are within the given edit distance of a sequence.
  • String to primitive type mappings, where the keys are stored in a perfect hashing automaton and the values in an (unboxed) array.

Using Dictomaton

Dictomaton is in the Maven Central Repository:

<dependency>
    <groupId>eu.danieldk.dictomaton</groupId>
    <artifactId>dictomaton</artifactId>
    <version>1.1.1</version>
</dependency>

SBT:

libraryDependencies += "eu.danieldk.dictomaton" % "dictomaton" % "1.1.1"

Grails:

compile 'eu.danieldk.dictomaton:dictomaton:1.1.1'

Comparisons

The following table compares the sizes of the object graphs of the Dictionary type of this library to that of TreeSet and HashSet. The comparisons were obtained by storing all the words in the web2 and web2a dictionaries and were measured using memory-measurer

Data typeObjectsReferencescharintbooleanfloat
TreeSet936277187255531937496241843120910
HashSet9362771772657319374993627711
Dictionary411889454642416939703311

Benchmarks

Benchmarks are in a different test group than normal unit tests. You can run benchmarks via Maven, adding the Benchmarks group:

mvn test -Djunit.groups=eu.danieldk.dictomaton.categories.Benchmarks

Changelog

1.2.0

  • Exposing state through StateInfo object, which allows user of PerfectHashDictionary to resume transitions, which makes it e.g. far more efficient to look up a string and its prefixes. (contributed by René Kriegler).
  • DictionaryBuilder now accepts adding more general CharSequence instead of String and uses CharSequence internally (contributed by René Kriegler).

1.1.0

  • Added immutable mapping from String to a generic type.
  • Added a key-ordered builder for immutable mappings. This builder is more efficient since it construct the key automaton on the fly.

1.0.0

  • Added Levenshtein automata for looking up sequences in a Dictionary that are within a certain edit distance of a sequence.
  • Provide a variant of perfect hash automata that puts right language cardinalities in transitions rather than states. This provides faster hashing and hashcode lookups at the cost of some memory.
  • Added String to String mapping (ImmutableStringStringMap).
  • Generic object values.

0.0.3

  • Fix an off-by-one error in integer width of the state table.

0.0.2

  • Rename the project from fsadict-java to dictomaton.
  • Store the state and transition tables as packed int arrays, resulting in drastically smaller automata.

Release plan

Plans for 1.3.0: Perhaps an explicit, fast, and compact data storage format as an alternative to Java serialization. C or C++ version.

Contributors

  • Daniël de Kok (maintainer)
  • René Kriegler

More Repositories

1

tensorflow

Go binding for Tensorflow
Go
131
star
2

go2vec

Read and use word2vec vectors in Go
Go
56
star
3

golinear

liblinear bindings for Go
Go
45
star
4

dpar

Neural network transition-based dependency parser (in Rust)
Rust
43
star
5

nix-home

Nix home environment
Nix
42
star
6

citar-cxx

Citar part of speech tagger
C++
40
star
7

awesome-glove80

Because Glove80 is awesome
36
star
8

jitar

Jitar HMM part of speech tagger
Java
22
star
9

sentencepiece

Rust binding for the sentencepiece library
Rust
20
star
10

citar

Citar HMM part-of-speech tagger
Go
15
star
11

par

Parallel for loops and maps for Go
Go
10
star
12

rust2vec

Read/write word2vec and GloVe embeddings in Rust
Rust
10
star
13

gemm-benchmark

Simple [sd]gemm benchmark, similar to ACES dgemm
Rust
9
star
14

conllu

Reader/writer for the CoNLL-U format
Rust
8
star
15

tinyest

Tiny maximum entropy model parameter estimator
C
8
star
16

quzah

Generate sets of distinct colors
Java
8
star
17

brainfuck-pl

A brainf*ck interpreter in Prolog
Prolog
8
star
18

conllx-utils

CoNLL-X utilities
Rust
7
star
19

conllx-rs

CoNLL-X reader and writers for Rust
Rust
7
star
20

dotfiles

Personal dotfiles
Shell
6
star
21

gosvm

Go binding for libsvm
Go
6
star
22

wordpieces

Split tokens into word pieces
Rust
5
star
23

golinear-examples

Examples for the golinear package.
Go
5
star
24

conllx

Go reader for the CONLL-X format
Go
4
star
25

digest-pure

Pure implementation of the digest package
Haskell
3
star
26

chu-liu-edmonds

Chu-Liu-Edmonds minimal spanning tree (MST) decoding
Rust
3
star
27

ir-examples

IR16/17 examples
Jupyter Notebook
3
star
28

viscosity-workflow

Alfred 2 workflow for Viscosity
Python
3
star
29

snakefusion

🐍 A slim, low-maintenance Python wrapper for finalfusion.
Rust
3
star
30

toponn

Topological field prediction, deprecated for sticker:
3
star
31

conllu-utils

Utilities for working with CoNLL-U
Rust
3
star
32

alpino-tokenizer

Rust wrapper for the Alpino tokenizer
Rust
3
star
33

dl4nlp-examples

DL4NLP examples
Jupyter Notebook
3
star
34

thinc-rust-ops

Non-official(!) Rust Thinc Ops, with SIMD-vectorized implementations of many ops. Because Ops are fun to implement.
Rust
3
star
35

simpletbl

Tranformation-based learner for tagging
C++
2
star
36

approx-rand-test

Approximate randomization test
Haskell
2
star
37

vec2web

Small go2vec demo application
HTML
2
star
38

rtrie

Randomized ternary tries
Rust
2
star
39

nix-packages

My Nix package set
Nix
2
star
40

gosvm-examples

Examples for the gosvm package
Go
2
star
41

ndarray-tensorflow

Tensor wrapper with ndarray API
Rust
2
star
42

homewizard-energy

Homewizard WiFi P1 meter
Lua
1
star
43

marlin-kernels

Experiment with standalone Marlin kernels
Cuda
1
star
44

sics-zlib

zlib binding for SICStus Prolog
C
1
star
45

conllx-view

Small CoNLL-X treebank viewer
Rust
1
star
46

alpino-tools

Various tools to deal with Alpino data
Haskell
1
star
47

seqalign

Sequence alignments
Rust
1
star
48

alpino-tokenizer-python

Rust
1
star
49

transformers-examples

Jupyter Notebook
1
star
50

kinesis-advantage

Kinesis Advantage2/360 notes
1
star
51

headline-generation

Transformation-based learner for headline generation
Java
1
star