• Stars
    star
    140
  • Rank 261,473 (Top 6 %)
  • Language
    C++
  • License
    MIT License
  • Created almost 12 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

fast, efficient, unicode aware HAT trie with prefix / suffix support for Ruby

Triez

Build Status Code Climate Gem Version

Pragmatic tries for Ruby, spelled in lolcat.

It is fast, memory efficient, unicode aware, prefix searchable, and enchanced with prefix/suffix/substring keys.

The backend of triez is a cache oblivious data structure: the HAT trie (In fact it is a modified version for improved functionality). HAT trie is generally faster and more memory efficient than double array or burst trie.

Requirement

  • CRuby 1.9 / 2.0
  • g++ or clang

Install

gem ins triez

Synopsis

require 'triez'

# create triez with (default value type) int64, and setting default value 0
t = Triez.new

# The default value is 0
t['foo'] #=> 0

# available options for value_type:
# - :int64  -- signed 64 bit integer
# - :object -- ruby object, but note that:
#              if the object is beyond simple types like NilClass, TrueClass, Integer,
#              it will take additional heap space
t = Triez.new value_type: :int64

# more flexible with object type [*see note below]
t = Triez.new value_type: :object

# get the value type
t.value_type

# set a different default value
t = Triez.new value_type: :object, default: 'hello'

# insert or change value
t['key'] = 100

# insert a key with default value
t << 'key'

# batch change values under all suffices/prefices/substrings of a key
t.change_all(:suffix, 'key') {|old_value| ...calculate new value }
t.change_all(:prefix, 'key') {|old_value| ...calculate new value }
# enumerates all occurences of substrings of the key
t.change_all(:substring, 'key') {|old_value| ...calculate new value }

# size of inserted keys
t.size

# search with exact match
t.has_key? 'key'
t['key']

# prefixed search (iterate over values under a prefix), available options are:
# - limit: max items, `nil` means no limit
# - sort: whether iterate in alphabetic order, default is true
t.search_with_prefix(prefix, limit: 10, sort: true) do |suffix, value|
  ...
end

# if no block given, an array in the form of [[suffix, value]] is returned
t.search_with_prefix('prefix')

# enumerate all keys and values in the order of binary collation
t.each do |key, value|
  ...
end

# iterate stored keys which are prefices of a given string, from shallow to deep
t.walk string do |k, v|
  ...
end

* Note: By default, triez store signed integers within 64bits, you can use them as weights, counts or database IDs. In case you need to store arbitrary object in a node, use value_type: :object:

t = Triez.new value_type: :object
t['Tom'] = {name: 'Tom', sex: 'Female'}
t['Tree'] = [:leaf, :trunk, :root]

Examples

Prefix based autocompletion:

require 'triez'
words = %w[readme, rot, red, rah, rasterization]
t = Triez.new
words.each do |word|
  t[word] = 1
end
t.search_with_prefix 're' do |suffix|
  puts "candidate: re#{suffix}"
end

The output:

candidate: readme
candidate: red

Efficient full text search with a suffix tree:

require 'triez'
sequences = {
  'ACTGAAAAAAACTG' => 1,
  'ATACGGTCCA' => 2,
  'GCTTGTACGT' => 3
}
t = Triez.new

# build suffix tree
sequences.each do |seq, id|
  t.change_all(:suffix, seq){id}
end

t.search_with_prefix 'CGGT' do |_, id|
  puts id #=> 2
end

The searching time is linear to the length of the substring. You may also be interested in the example of a simple full text search server with triez.


Solve the longest common substring problem:

# coding: utf-8
require 'triez'
sentences = %w[
  万塘路一锅鸡
  去文二路一锅鸡吃饭
  来一锅鸡顶盒
  一锅鸡胗
]

# value is bitset representing id of the sentence
# in ruby we can use integers of arbitrary length as bitsets
t = Triez.new value_type: :object, default: 0

sentences.each_with_index do |sentence, i|
  elem = 1 << i
  t.change_all :substring, sentence do |v|
    # union
    v | elem
  end
end

# longest common substring
lcs = ''

# find the key tagged with universe
universe = (1 << sentences.size) - 1
t.each do |k, v|
  lcs = k if k.size > lcs.size and v == universe
end

puts lcs #=> 一锅鸡

Benchmark

Here's a benchmark on

ruby 1.9.3p374 (2013-01-15 revision 38858) [x86_64-darwin12.2.1]
2.3 GHz Intel Core i7

The test data are 3 milion titles of wikipedia articles (from http://dumps.wikimedia.org/enwiki/20121101/)

thing/backend           | memory  | insertion time | 3 M query
------------------------|---------|----------------|----------
hash/linked hash        | 340.2 M |    4.369 s     | 0.2800 s
fast_trie/double array* | 155.6 M |    130.7 s     | 0.4359 s
triez/HAT trie          | 121.7 M |    3.872 s     | 0.3472 s

Note: fast_trie/double array -> https://github.com/tyler/trie

Caveats

  • The sort option in prefixed search orders keys with binary collation, but string comparison in Ruby is with unicode codepoint collation.
  • For some rare case of many threads modifying the same trie, you may need a mutex.
  • If you still feel memory not enough, you may consider MARISA-trie (note that MARISA is immutable), or a database.

Development

git clone git://github.com/luikore/triez.git
cd triez
rake glob_src
rake

To update vendor lib and re-compile:

rake glob_src
rake

Note

Although HAT trie uses MurMurHash3 instead of SipHash in Ruby, It is still safe under hashDoS because bucket size is limited.

More Repositories

1

nyara

Fast and fuzzy ruby web framework + server
C
81
star
2

rsec

Parser / Regexp Combinator For Ruby
JavaScript
77
star
3

markascend

Markdown-like but less syntax elements, with TeX-like macros and Slim-style indented blocks
Ruby
18
star
4

regexp_optimized_union

Regexp.optimized_union(word_list, regexp_options) generates optimized regexp for matching union of word list
Ruby
11
star
5

zscan

Improved string scanner
C++
9
star
6

stimulus-bind

Enable simple data binding for stimulusjs
CoffeeScript
9
star
7

spellbreak

A "your magic is mine"-magic-like language and runtime
C++
8
star
8

rfc-base32

RFC4648 Base32 Encode / Decode; For Ruby 1.9.x
Ruby
7
star
9

lzss

LZSS compress algorithm for Ruby
C
6
star
10

gems.mirror

Simple gems mirror for bundle speed, all mirrored files are lazy-downloaded so it's simple to start.
Ruby
5
star
11

scala-gvim-accessories

scala gvim accessories
Vim Script
4
star
12

bk201

In-memory asm to binary compiler, so you can use asm in ruby !
Ruby
4
star
13

vimfile

my vimfile config
Vim Script
3
star
14

vscode-hypertab

The Missing Tab Completion for VS Code
TypeScript
3
star
15

I18n-t.tmbundle

Query / Generate Rails / Padrino I18n Locale for TextMate
Ruby
3
star
16

style-scoped.js

Cross browser `<style scoped>` support, requires jquery
CoffeeScript
2
star
17

nyario

Fiber-based high performance IO lib
Ruby
2
star
18

mingwX-for-rubinius

Posix headers stubs for compiling rubinius under mingw
C
2
star
19

Markascend.tmbundle

Textmate / Sublime support for Markascend http://github.com/luikore/markascend
2
star
20

cici2

C
1
star
21

ng-todo-list

play ng, make a todo list
CoffeeScript
1
star
22

rainbow-git

gitist (deprecated, still here just to save some legacy code)
JavaScript
1
star
23

magan

yet yet yet another stanamic PEG parser generator
Ruby
1
star
24

property-list

Property list (plist) library with all formats support
Ruby
1
star
25

bw

pure text editor
Ruby
1
star
26

keycap

1.5u keycap for Kailh low profile switch
1
star