• Stars
    star
    88
  • Rank 375,465 (Top 8 %)
  • Language
    Go
  • License
    Apache License 2.0
  • Created about 10 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

segment

Tests

A Go library for performing Unicode Text Segmentation as described in Unicode Standard Annex #29

Features

  • Currently only segmentation at Word Boundaries is supported.

License

Apache License Version 2.0

Usage

The functionality is exposed in two ways:

  1. You can use a bufio.Scanner with the SplitWords implementation of SplitFunc. The SplitWords function will identify the appropriate word boundaries in the input text and the Scanner will return tokens at the appropriate place.

    scanner := bufio.NewScanner(...)
    scanner.Split(segment.SplitWords)
    for scanner.Scan() {
    	tokenBytes := scanner.Bytes()
    }
    if err := scanner.Err(); err != nil {
    	t.Fatal(err)
    }
    
  2. Sometimes you would also like information returned about the type of token. To do this we have introduce a new type named Segmenter. It works just like Scanner but additionally a token type is returned.

    segmenter := segment.NewWordSegmenter(...)
    for segmenter.Segment() {
    	tokenBytes := segmenter.Bytes())
    	tokenType := segmenter.Type()
    }
    if err := segmenter.Err(); err != nil {
    	t.Fatal(err)
    }
    

Choosing Implementation

By default segment does NOT use the fastest runtime implementation. The reason is that it adds approximately 5s to compilation time and may require more than 1GB of ram on the machine performing compilation.

However, you can choose to build with the fastest runtime implementation by passing the build tag as follows:

	-tags 'prod'

Generating Code

Several components in this package are generated.

  1. Several Ragel rules files are generated from Unicode properties files.
  2. Ragel machine is generated from the Ragel rules.
  3. Test tables are generated from the Unicode test files.

All of these can be generated by running:

	go generate

Fuzzing

There is support for fuzzing the segment library with go-fuzz.

  1. Install go-fuzz if you haven't already:

    go get github.com/dvyukov/go-fuzz/go-fuzz
    go get github.com/dvyukov/go-fuzz/go-fuzz-build
    
  2. Build the package with go-fuzz:

    go-fuzz-build github.com/blevesearch/segment
    
  3. Convert the Unicode provided test cases into the initial corpus for go-fuzz:

    go test -v -run=TestGenerateWordSegmentFuzz -tags gofuzz_generate
    
  4. Run go-fuzz:

    go-fuzz -bin=segment-fuzz.zip -workdir=workdir
    

Status

Build Status

Coverage Status

GoDoc

More Repositories

1

bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Go
9,964
star
2

vellum

A Go library implementing a FST (finite state transducer)
Go
182
star
3

bleve-explorer

An example app providing an HTTP/REST/JSON front-end to bleve
JavaScript
120
star
4

beer-search

example bleve application for indexing and search beers and breweries
JavaScript
90
star
5

hugoidx

An experimental app to build a Bleve search index from the pages of a Hugo site
JavaScript
59
star
6

blevex

Bleve Extensions
Go
46
star
7

bleve-mapping-ui

web-based UI editor for bleve index mappings
JavaScript
24
star
8

bleve-wiki-indexer

maintains a bleve index of markdown files in the specified directory, exposes search on port 8099
Go
24
star
9

snowballstem

Go stemmers generated by the Snowball project
Go
21
star
10

bleve-bench

A utility for benchmarking bleve performance under various configurations and workloads.
Go
15
star
11

bleve-hosted

A general purpose application which can be used to host read-only access to one or more Bleve indexes
Go
13
star
12

zapx

Zap file format compatible with a future version of Bleve
Go
10
star
13

sear

a Bleve index implementation designed for efficiently executing searches against a single document (or a sequence of documents one at a time)
Go
8
star
14

blevesearch.github.io-hugo

Hugo Source for blevesearch.github.io website
JavaScript
6
star
15

stempel

A Go implementation of the Stempel stemmer
Go
6
star
16

bleve_index_api

The Bleve internal index API
Go
5
star
17

zap

Zap segment plugin for Bleve Scorch indexes
Go
4
star
18

analysis-wizard

An interactive tool to explore text analysis behavior in Bleve
JavaScript
4
star
19

blevesearch.github.io

blevesearch website
HTML
3
star
20

gopherconin-search

A sample bleve application that indexes/searches the GopherCon India schedule
HTML
2
star
21

scorch_segment_api

The Scorch internal segment API
Go
2
star
22

upsidedown_store_api

The Upside Down key-value store API
Go
1
star
23

fosdem-search

bleve sample app to search FOSDEM schedule
JavaScript
1
star
24

fosdem15

Bleve presentation for FOSDEM'15
Go
1
star
25

gophercon15

Lightning talk given at GopherCon 15
Go
1
star