JamSpell

JamSpell is a spell checking library with following features:

accurate - it considers words surroundings (context) for better correction
fast - near 5K words per second
multi-language - it's written in C++ and available for many languages with swig bindings

Colab example

JamSpellPro

jamspell.com - check out a new jamspell version with following features

Improved accuracy (catboost gradient boosted decision trees candidates ranking model)
Splits merged words
Pre-trained models for many languages (small, medium, large) for:
en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no
Ability to add words / sentences at runtime
Fine-tuning / additional training
Memory optimization for training large models
Static dictionary support
Built-in Java, C#, Ruby support
Windows support

Content

Benchmarks
Usage
- Python
- C++
- Other languages
- HTTP API
Train

Benchmarks

	Errors	Top 7 Errors	Fix Rate	Top 7 Fix Rate	Broken	Speed (words/second)
JamSpell	3.25%	1.27%	79.53%	84.10%	0.64%	4854
Norvig	7.62%	5.00%	46.58%	66.51%	0.69%	395
Hunspell	13.10%	10.33%	47.52%	68.56%	7.14%	163
Dummy	13.14%	13.14%	0.00%	0.00%	0.00%	-

Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).

We used following metrics:

Errors - percent of words with errors after spell checker processed
Top 7 Errors - percent of words missing in top7 candidated
Fix Rate - percent of errored words fixed by spell checker
Top 7 Fix Rate - percent of errored words fixed by one of top7 candidates
Broken - percent of non-errored words broken by spell checker
Speed - number of words per second

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

	Errors	Top 7 Errors	Fix Rate	Top 7 Fix Rate	Broken	Speed (words per second)
JamSpell	3.56%	1.27%	72.03%	79.73%	0.50%	5524
Norvig	7.60%	5.30%	35.43%	56.06%	0.45%	647
Hunspell	9.36%	6.44%	39.61%	65.77%	2.95%	284
Dummy	11.16%	11.16%	0.00%	0.00%	0.00%	-

More details about reproducing available in "Train" section.

Usage

Python

Install swig3 (usually it is in your distro package manager)
Install jamspell:

pip install jamspell

Download or train language model
Use it:

import jamspell

corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')

corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

Add jamspell and contrib dirs to your project
Use it:

#include <jamspell/spell_corrector.hpp>

int main(int argc, const char** argv) {

    NJamSpell::TSpellCorrector corrector;
    corrector.LoadLangModel("model.bin");

    corrector.FixFragment(L"I am the begt spell cherken!");
    // "I am the best spell checker!"

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "best", "beat", "belt", "bet", "bent", ... )

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0;
}

Other languages

You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.

HTTP API

Install cmake
Clone and build jamspell (it includes http server):

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make

Download or train language model
Run http server:

./web_server/web_server en.bin localhost 8080

GET Request example:

$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker

POST Request example

$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checker

Candidate example

curl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates

{
    "results": [
        {
            "candidates": [
                "best",
                "beat",
                "belt",
                "bet",
                "bent",
                "beet",
                "beit"
            ],
            "len": 4,
            "pos_from": 9
        },
        {
            "candidates": [
                "checker",
                "chicken",
                "checked",
                "wherein",
                "coherent",
                "cheered",
                "cherokee"
            ],
            "len": 7,
            "pos_from": 20
        }
    ]
}

Here pos_from - misspelled word first letter position, len - misspelled word len

Train

To train custom model you need:

Install cmake
Clone and build jamspell:

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make

Prepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)
Train model:

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin

To evaluate spellchecker you can use evaluate/evaluate.py script:

python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt

You can use evaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.

Download models

Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.

en.tar.gz (35Mb)
fr.tar.gz (31Mb)
ru.tar.gz (38Mb)

bakwc/JamSpell

bakwc

Reviews

Repository Details