• Stars
    star
    101
  • Rank 338,166 (Top 7 %)
  • Language
    C
  • License
    Mozilla Public Li...
  • Created about 13 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Deprecated. See https://github.com/varnamproject/govarnam

Introduction

libvarnam is a cross platform, self learning, open source library which support transliteration and reverse transliteration for Indian languages. At the core is a C shared library providing algorithms and patterns for transliteration. libvarnam has a simple learning module built-in which can learn words to improve the transliteration experience.

Installing libvarnam

wget http://download.savannah.gnu.org/releases/varnamproject/libvarnam/source/libvarnam-$VERSION.tar.gz
tar -xvf libvarnam-$VERSION.tar.gz
cd libvarnam-$VERSION
cmake . && make
sudo make install

This will install libvarnam shared libraries and varnamc command line utility. varnamc can be used to quickly try out varnam.

Installation on Windows

In Windows, you can compile libvarnam using Visual Studio. Use the following cmake command to generate the project files.

cmake -DBUILD_TESTS=false -DBUILD_VST=false -DRUN_TESTS=false .

Usage

Transliterate

Usage: varnamc -s lang_code -t word

varnamc -s ml -t varnam
 വർണം
 വർണമേറിയത്

Reverse Transliterate

Usage: varnamc -s lang_code -r word

varnamc -s ml -r വർണം
 varnam

Word corpus

libvarnam is a learning system. It works better with a word corpus. You can obtain the word corpus and make varnam learn all the words. This will enable libvarnam to provide intelligent suggestions.

Here is an example of loading Malayalam word corpus:

mkdir words
cd words
wget http://download.savannah.gnu.org/releases/varnamproject/words/ml/ml.tar.gz
tar -xvf ml.tar.gz
varnamc  -s ml --learn-from .

This will take some time depends on how much words you are loading.

Here are some more word corpus

There is a --import-learnings-from option to import files which already has the learnt parameter. Importing these files don't take too much time as the word corpus.

What next?

If you just wanted to use varnam for input, you have the following options

If you are a programmer, you will be interested in libvarnam. You can use it to provide Indian language support in your applications. libvarnam can be used from different programming languages.

How Varnam works

  1. Scheme files and symbol tables
  2. Transliteration
  3. Learning

Scheme files and symbol tables

Scheme file maps English letters to phonetic equivalent indic letters. In this, all vowels, consonants and consonant clusters are mapped to the indic equivalent. Varnam uses the scheme file mapping to perform transliteration.

Scheme files are plain text but uses a custom DSL to make the mapping easier. This DSL is implemented using Ruby and it can contain any valid Ruby code. It also provides many helper functions to make the mapping easier.

schemes/ directory contains all the scheme files for the supported languages. Each language is represented with it's ISO language code.

Symbol tables

Compiled version of Scheme file is called as Varnam Symbol Table (vst). This compilation is done using varnamc command line utility

varnamc --compile schemes/ml

Symbol tables are binary representation of the plain text scheme files. It also contains other metadata items to make the lookup easier.

libvarnam understand only the symbol table format. Because of this, every scheme file should be compiled into vst format before it can be used with varnam.

make vst

can be used to compile all scheme files present in the schemes directory.

Symbol table lookup

Varnam can be initialized with just the ISO language code. When this happens, varnam will scan the following directories and tries to find a matching symbol table file. If one is found, it will be loaded and used for all operations.

  • "/usr/local/share/varnam/vst"
  • "/usr/share/varnam/vst"
  • "schemes"

Transliteration

varnam_transliterate(varnam *handle, const char *input, varray **output);

Is the entry point for transliteration. Transliteration converts input to the phonetic equivalent indic text. It also provides a set of matches which are possible for the given input.

Transliteration does the following steps under the hood:

Performs tokenization on the input. Varnam uses a greedy tokenizer which processes input from left to right. Tokenizer tries all possible to combinations to generate the longest possible tokens for the given input. This token will be generated by utilizing the symbol table which is provided to varnam

Generated tokens is assembled and varnam computes all possibilities of these tokens. Assume the input is malayalam, varnam generates tokens like, മ, ല, യാ, ളം ([ma], [la], [ya], [lam]) and many others. Once these tokens are generated, they are combined and tested against the learning model to get rid of garbage values and come up with most used words. Words are sorted according to the frequency value and returned to the caller function.

Renderer

All of the processing is varnam is mostly language agnostic. It should work fine for all Indian languages. However, sometimes language specific fixes might be required. Varnam handles this using Renderers. Any language can register renderers and varnam will invoke the renderers just before rendering the final output. This can have language specific rules which can't be generalized otherwise.

Learning

varnam_learn(varnam *handle, const char *word);

Varnam can learn new words. The more words it learns, the better it performs. Learning process learns the words and it's patterns.

Learning process persists the following data:

  1. Patterns: All english combinations which can be used to input the given indic text
  2. Words: Indic text itself
  3. Prefixes: Prefixes of patterns and words

When an indic word is learned, varnam tokenizes the word using the symbol table and tries to learn all possible patterns that can be used to input the word. Internally, varnam keeps a prefix tree and frequencies of all patterns. This storage structure allows varnam to retrieve matching words efficiently when a pattern is presented. Basic stemming is also performed while learning words.

When the same word/pattern combination is learned, varnam computes frequency at which it has seen this pattern. This frequency is used to sort and pick the best candidate while performing transliteration.

Learning can be initiated by calling Varnam APIs directly or using varnamc.

Input tools like ibus-engine will automatically learn the words that you are typing.

Learned data is kept in one of the following locations:

  • APPDATA\varnam\suggestions (Windows)
  • XDG_DATA_HOME/varnam/suggestions
  • HOME/.local/share/varnam/suggestions

Mozilla Public License

Copyright (c) 2016 Navaneeth.K.N

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.

More Repositories

1

govarnam

Easily type Indic languages on computer and mobile. GoVarnam is a cross-platform transliteration library. Manglish -> Malayalam, Thanglish -> Tamil, Hinglish -> Hindi plus another 10 languages. GoVarnam is a near-Go port of libvarnam
Go
145
star
2

webpage-embed-plugin

Allow Indian language typing easily in any website
TypeScript
27
star
3

varnamd-govarnam

API server for Varnam
Go
20
star
4

webIME

A JavaScript Input Method Engine inspired by ibus on GNU/Linux
JavaScript
16
star
5

govarnam-ibus

Indian Language Input Method for Linux. IBus Engine for GoVarnam. An easy way to type Indian languages on GNU/Linux systems such as Ubuntu, Debian, ArchLinux, Manjaro etc.
Go
13
star
6

varnam-macOS

Easily type Indian languages on macOS !
Swift
13
star
7

varnam-windows

Type Indian languages easily on Windows!
C++
12
star
8

libvarnam-ibus

Using libvarnam with IBus input engine
C
8
star
9

varnamproject.com

Website for varnamproject
JavaScript
8
star
10

varnamd

Varnam daemon which also acts as a HTTP server. Deprecated. See https://github.com/varnamproject/varnamd-govarnam/
Go
8
star
11

govarnam-rust

Rust bindings for govarnam.
Rust
6
star
12

varnam-fcitx5

Fcitx5 wrapper for Varnam input method. Easily type Indian languages on Linux desktops.
C++
6
star
13

libvarnam-nodejs

Node.js port of libvarnam
C++
3
star
14

varnam-browser-addons

Browser addons for libvarnam library
JavaScript
2
star
15

govarnam-java

Java bindings for GoVarnam
C
1
star
16

desktop

Cross-platform desktop Varnam editor app
Go
1
star
17

schemes

Language related files for Varnam [GoVarnam]. See releases to download support for your language in Varnam.
Ruby
1
star
18

editor

A JavaScript frontend editor for Varnam. Used in website and desktop app
Vue
1
star
19

varnamproject.github.io

New website for Varnam
HTML
1
star