• Stars
    star
    250
  • Rank 162,397 (Top 4 %)
  • Language
    Rust
  • License
    MIT License
  • Created about 4 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Library used by Meilisearch to tokenize queries and documents

Charabia

Library used by Meilisearch to tokenize queries and documents

Role

The tokenizerโ€™s role is to take a sentence or phrase and split it into smaller units of language, called tokens. It finds and retrieves all the words in a string based on the languageโ€™s particularities.

Details

Charabia provides a simple API to segment, normalize, or tokenize (segment + normalize) a text of a specific language by detecting its Script/Language and choosing the specialized pipeline for it.

Supported languages

Charabia is multilingual, featuring optimized support for:

Script / Language specialized segmentation specialized normalization Segmentation Performance level Tokenization Performance level
Latin โœ… CamelCase segmentation โœ… compatibility decomposition + lowercase + nonspacing-marks removal ๐ŸŸฉ ~23MiB/sec ๐ŸŸจ ~9MiB/sec
Greek โŒ โœ… compatibility decomposition + lowercase + final sigma normalization ๐ŸŸฉ ~27MiB/sec ๐ŸŸจ ~8MiB/sec
Cyrillic - Georgian โŒ โœ… compatibility decomposition + lowercase ๐ŸŸฉ ~27MiB/sec ๐ŸŸจ ~9MiB/sec
Chinese CMN ๐Ÿ‡จ๐Ÿ‡ณ โœ… jieba โœ… compatibility decomposition + pinyin conversion ๐ŸŸจ ~10MiB/sec ๐ŸŸง ~5MiB/sec
Hebrew ๐Ÿ‡ฎ๐Ÿ‡ฑ โŒ โœ… compatibility decomposition + nonspacing-marks removal ๐ŸŸฉ ~33MiB/sec ๐ŸŸจ ~11MiB/sec
Arabic โœ… ุงู„ segmentation โœ… compatibility decomposition + nonspacing-marks removal + [Tatweel, Alef, Yeh, and Taa Marbuta normalization] ๐ŸŸฉ ~36MiB/sec ๐ŸŸจ ~11MiB/sec
Japanese ๐Ÿ‡ฏ๐Ÿ‡ต โœ… lindera IPA-dict โŒ compatibility decomposition ๐ŸŸง ~3MiB/sec ๐ŸŸง ~3MiB/sec
Korean ๐Ÿ‡ฐ๐Ÿ‡ท โœ… lindera KO-dict โŒ compatibility decomposition ๐ŸŸฅ ~2MiB/sec ๐ŸŸฅ ~2MiB/sec
Thai ๐Ÿ‡น๐Ÿ‡ญ โœ… dictionary based โœ… compatibility decomposition + nonspacing-marks removal ๐ŸŸฉ ~22MiB/sec ๐ŸŸจ ~11MiB/sec

We aim to provide global language support, and your feedback helps us move closer to that goal. If you notice inconsistencies in your search results or the way your documents are processed, please open an issue on our GitHub repository.

If you have a particular need that charabia does not support, please share it in the product repository by creating a dedicated discussion.

About Performance level

Performances are based on the throughput (MiB/sec) of the tokenizer (computed on a scaleway Elastic Metal server EM-A410X-SSD - CPU: Intel Xeon E5 1650 - RAM: 64 Go) using jemalloc:

  • 0๏ธโƒฃโฌ›๏ธ: 0 -> 1 MiB/sec
  • 1๏ธโƒฃ๐ŸŸฅ: 1 -> 3 MiB/sec
  • 2๏ธโƒฃ๐ŸŸง: 3 -> 8 MiB/sec
  • 3๏ธโƒฃ๐ŸŸจ: 8 -> 20 MiB/sec
  • 4๏ธโƒฃ๐ŸŸฉ: 20 -> 50 MiB/sec
  • 5๏ธโƒฃ๐ŸŸช: 50 MiB/sec or more

Examples

Tokenization

use charabia::Tokenize;

let orig = "Thรฉ quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3ยฐF!";

// tokenize the text.
let mut tokens = orig.tokenize();

let token = tokens.next().unwrap();
// the lemma into the token is normalized: `Thรฉ` became `the`.
assert_eq!(token.lemma(), "the");
// token is classfied as a word
assert!(token.is_word());

let token = tokens.next().unwrap();
assert_eq!(token.lemma(), " ");
// token is classfied as a separator
assert!(token.is_separator());

Segmentation

use charabia::Segment;

let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3ยฐF!";

// segment the text.
let mut segments = orig.segment_str();

assert_eq!(segments.next(), Some("The"));
assert_eq!(segments.next(), Some(" "));
assert_eq!(segments.next(), Some("quick"));

More Repositories

1

meilisearch

A lightning-fast search API that fits effortlessly into your apps, websites, and workflow
Rust
46,587
star
2

meilisearch-js

JavaScript client for the Meilisearch API
TypeScript
731
star
3

meilisearch-php

PHP wrapper for the Meilisearch API
PHP
602
star
4

heed

A fully typed LMDB wrapper with minimum overhead ๐Ÿฆ
Rust
586
star
5

meilisearch-go

Golang wrapper for the Meilisearch API
Go
515
star
6

meilisearch-js-plugins

The search client to use Meilisearch with InstantSearch.
TypeScript
469
star
7

meilisearch-laravel-scout

MeiliSearch integration for Laravel Scout
PHP
465
star
8

milli

Search engine library for Meilisearch โšก๏ธ
Rust
464
star
9

meilisearch-python

Python wrapper for the Meilisearch API
Python
453
star
10

meilisearch-rust

Rust wrapper for the Meilisearch API.
Rust
350
star
11

MeiliES

A Rust based event store using the Redis protocol
Rust
324
star
12

meilisearch-rails

Meilisearch integration for Ruby on Rails
Ruby
295
star
13

docs-scraper

Scrape documentation into Meilisearch
Python
284
star
14

meilisearch-dotnet

.NET wrapper for the Meilisearch API
C#
256
star
15

mini-dashboard

mini-dashboard for Meilisearch
JavaScript
227
star
16

strapi-plugin-meilisearch

A strapi plugin to add your collections to Meilisearch
JavaScript
220
star
17

meilisearch-kubernetes

Meilisearch on Kubernetes Helm charts and manifests
Mustache
208
star
18

arroy

An Approximate Nearest Neighbors library in Rust, based on random projections and LMDB and optimized for memory usage ๐Ÿ’ฅ
Rust
207
star
19

meilisearch-ruby

Ruby SDK for the Meilisearch API
Ruby
196
star
20

meilisearch-react

194
star
21

meilisearch-java

Java client for Meilisearch
Java
183
star
22

docs-searchbar.js

Front-end search bar for documentation with Meilisearch
JavaScript
166
star
23

meilisearch-vue

154
star
24

documentation

Meilisearch documentation
MDX
145
star
25

integration-guides

Central reference for Meilisearch integrations.
Shell
137
star
26

meilisearch-symfony

Seamless integration of Meilisearch into your Symfony project.
PHP
124
star
27

awesome-meilisearch

A curated list of awesome Meilisearch resources
103
star
28

meilisearch-swift

Swift client for the Meilisearch API
Swift
93
star
29

firestore-meilisearch

Fulltext search on Firebase with Meilisearch
TypeScript
85
star
30

ecommerce-demo

Nuxt 3 ecommerce site search with filtering and facets powered by Meilisearch
Vue
84
star
31

meilisearch-dart

The Meilisearch API client written for Dart
Dart
78
star
32

saas-demo

App search in a CRM use case, powered by Meilisearch
PHP
75
star
33

vuepress-plugin-meilisearch

Add a relevant and typo tolerant search bar to your VuePress
JavaScript
64
star
34

product

Public feedback and ideation discussions for Meilisearch product ๐Ÿ”ฎ
55
star
35

meilisearch-wordpress

WordPress plugin for Meilisearch.
PHP
53
star
36

demos

A list of Meilisearch demos with open-source code and live preview โšก๏ธ
CoffeeScript
52
star
37

demo-movies

Next.js app to find streaming platform to watch movies
JavaScript
47
star
38

gatsby-plugin-meilisearch

A plugin to index your Gatsby content to Meilisearch based on graphQL queries
JavaScript
40
star
39

landing

Meilisearch's landing page
JavaScript
35
star
40

meilisearch-migration

Scripts to update Meilisearch version's.
Shell
34
star
41

devrel

Anything Developer Relations at Meili
CSS
26
star
42

meilisearch-angular

Instant Meilisearch for Angular Framework
24
star
43

meilisearch-digitalocean

Meilisearch services on DigitalOcean
Python
24
star
44

grenad

Tools to sort, merge, write, and read immutable key-value pairs ๐Ÿ…
Rust
24
star
45

deserr

Deserialization library with focus on error handling
Rust
24
star
46

scrapix

TypeScript
21
star
47

meilisearch-aws

AWS services for Meilisearch
Python
20
star
48

cargo-flaky

A cargo sub-command to helps you find flaky tests
Rust
20
star
49

meilisearch-gcp

Meilisearch services on GCP
Python
20
star
50

madness

an async mdns library for tokio
Rust
19
star
51

specifications

Track specification elaboration.
17
star
52

meilisearch-importer

A CLI to import massive CSV and NdJson into Meilisearch
Rust
17
star
53

cloud-providers

โ˜ Meilisearch DevOps Tools for the Cloud โ˜
Shell
17
star
54

demo-finding-crates

Expose all crates from crates.io with MeiliSearch
Rust
17
star
55

transplant

Rust
15
star
56

engine-team

Repository gathering the development process of the core-team
15
star
57

obkv

A micro key-value store where the key is always one byte
Rust
12
star
58

compute-embeddings

A small tool to compute the embeddings of a list of JSON documents
Rust
12
star
59

cloud-scripts

Cloud scripts for cloud provider agnostic configuration
Shell
9
star
60

demo-finding-rubygems

Alternative search bar for RubyGems
Ruby
8
star
61

minimeili-raft

A small implementation of a dummy Meilisearch running on top of Raft
Rust
7
star
62

strapi-plugin-meilisearch-v4

Work in progress
JavaScript
6
star
63

meili-aoc

meili-aoc
Rust
6
star
64

searchbar.js

wip
JavaScript
6
star
65

demo-MoMA

A MeiliSearch demo using the Museum Of Modern Art Collection
JavaScript
6
star
66

vercel-demo

A website that lets you know where to watch a movie built on Next.js and Meilisearch, deployed on Vercel with the Meilisearch + Vercel integration.
JavaScript
6
star
67

mini-search-engine-presentation

A simple and "short" presentation of the search engine
5
star
68

jayson

Rust
4
star
69

meilisearch-flutter

[wip] A basic UI kit with Meilisearch search widgets for Flutter
CMake
4
star
70

nelson

Rust
4
star
71

demo-finding-pypi

Alternative search bar for PyPI packages
Python
4
star
72

nextjs-starter-meilisearch-table

TypeScript
3
star
73

open-api

3
star
74

js-project-boilerplate

A boilerplate providing basic configuration for JavaScript projects in Meilisearch
3
star
75

synonyms

2
star
76

.github

2
star
77

parallel-write-exp

A parallel indexer experiment for Meilisearch
Rust
2
star
78

devops-tools

Shell
2
star
79

discord-bot-productboard

JavaScript
2
star
80

strois

A simple non-async S3 client based on the REST API
Rust
2
star
81

datasets

2
star
82

poc-vector-store-recall

A experimental tool that uses the vector store to increase Meilisearch's recall
Rust
2
star
83

actions

Meilisearch Github Actions
JavaScript
1
star
84

devspector

Develop specification inspector
JavaScript
1
star
85

design-team

1
star
86

movies-react-demo

Created with CodeSandbox
HTML
1
star
87

ansible-vm-benchmarks

Ansible Playbook to index datasets on several typology of Instance on a specific Meilisearch version/commit
Rust
1
star
88

akamai-purge

A Rust helper to purge Akamai cache
Rust
1
star
89

poc-heed-codec

A repository to help us define the new design of heed
Rust
1
star
90

mainspector

Main specification inspector
JavaScript
1
star
91

massive-meilisearch-sampling

A program that generates and sends dataset and samples update/deletes to a Meilisearch server
Rust
1
star
92

benchboard

Benchmark dashboard
Rust
1
star
93

settings_guessr

A tool that guess your settings by using the dataset
Rust
1
star
94

alberto

A program that displays the size of the documents in a Meilisearch database.
Rust
1
star
95

musicbrainz-demo

A demo showcasing Meilisearch with a large musics dataset coming from Musicbrainz
JavaScript
1
star
96

meilisearch-webhook-usage-example

Example of how to use the meilisearch webhook
Rust
1
star
97

meilikeeper

A sync zookeeper client on top of the official C client
Rust
1
star
98

zookeeper-client-sync

zookeeper-client-sync
Rust
1
star