turkish-deasciifier: Turkish deasciifier
This is a deasciifier Python library and command line utility for Turkish that solves the problem of diacritics restoration (also known as diacritics reconstruction). It takes a Turkish string containing only ASCII characters (that is, without proper diacritics) and replaces the relevant characters with their corresponding Turkish letters.
The web-based, online version of this system is available at:
Keep in mind that diacritics restoration (deasciification) for Turkish doesn't work 100% of the time; it is an active research topic! Still, this library is good enough for many practical purposes, and served many people and projects in the last 10 years.
This system is based on the turkish-mode for GNU Emacs by Prof. Deniz Yüret.
Table of Contents
- Installation
- Example Python Library Usage
- Example CLI (Command Line Interface) Usage
- Other Programming Languages and Systems
- Advanced Research
Installation
Python 3
For now, the recommended way to install is to use pip and install direcly from the project's GitHub repository:
pip install git+https://github.com/emres/turkish-deasciifier.git
Python 2
Keep in mind that switching to Python 3 is strongly recommended! If you insist on using Python 2.x, you can install using the following command:
pip install Turkish-Deasciifier
Example Python Library Usage
Python 3
from turkish.deasciifier import Deasciifier
my_ascii_turkish_txt = "Opusmegi cagristiran catirtilar."
deasciifier = Deasciifier(my_ascii_turkish_txt)
my_deasciified_turkish_txt = deasciifier.convert_to_turkish()
print(my_deasciified_turkish_txt)
Python 2
Keep in mind that switching to Python 3 is strongly recommended! If you insist on using Python 2.x, you can use the library in the following manner:
from turkish.deasciifier import Deasciifier
my_ascii_turkish_txt = "Opusmegi cagristiran catirtilar."
deasciifier = Deasciifier(my_ascii_turkish_txt.decode("utf-8"))
my_deasciified_turkish_txt = deasciifier.convert_to_turkish()
print my_deasciified_turkish_txt.encode("utf-8")
Example CLI (Command Line Interface) Usage
Python 3
Example tested in a Bash shell:
$ echo "Opusmegi cagristiran catirtilar." | turkish-deasciify
$ cat somefile.txt | turkish-deasciify
Python 2
Keep in mind that switching to Python 3 is strongly recommended!
Example tested in a Bash shell:
$ echo "Opusmegi cagristiran catirtilar." | turkish-deasciify-python2
$ cat somefile.txt | turkish-deasciify-python2
Other Programming Languages and Systems
- Java: https://github.com/ahmetb/turkish-deasciifier-java
- Perl: https://metacpan.org/pod/release/BURAK/Lingua-TR-ASCII-0.13/lib/Lingua/TR/ASCII.pm
- Haskell: http://hackage.haskell.org/package/turkish-deasciifier
- Node.js: https://github.com/f/deasciifier/
- VIM: https://github.com/joom/turkish-deasciifier.vim
- Emacs Lisp: https://github.com/emres/turkish-mode (also available as a package in MELPA)
Advanced Research
For recent advanced scientific research articles, please see the following:
- Diacritic Restoration Using Recurrent Neural Network
- Diacritics Restoration Using Neural Networks
- Diacritic restoration of Turkish tweets with word2vec
- Vowel and Diacritic Restoration for Social Media Texts
- Paper: https://www.aclweb.org/anthology/W14-1307/
- Full text (PDF): https://www.aclweb.org/anthology/W14-1307.pdf
- Web demo: http://tools.nlp.itu.edu.tr/Deasciifier