• Stars
    star
    928
  • Rank 47,457 (Top 1.0 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 8 years ago
  • Updated 10 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Thai Natural Language Processing in Python.

PyThaiNLP: Thai Natural Language Processing in Python

pypi Python 3.7 License Download Unit test and code coverage Coverage Status Google Colab Badge DOI Chat on Matrix

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with focus on Thai language.

PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD

News

Now, You can contact or ask any questions with the PyThaiNLP team. Chat on Matrix

Version Description Status
4.0 Stable Change Log
dev Release Candidate for 4.1 Change Log

Getting Started

Capabilities

PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.

List of Features
  • Convenient character and word classes, like Thai consonants (pythainlp.thai_consonants), vowels (pythainlp.thai_vowels), digits (pythainlp.thai_digits), and stop words (pythainlp.corpus.thai_stopwords) -- comparable to constants like string.letters, string.digits, and string.punctuation
  • Thai linguistic unit segmentation/tokenization, including sentence (sent_tokenize), word (word_tokenize), and subword segmentations based on Thai Character Cluster (subword_tokenize)
  • Thai part-of-speech tagging (pos_tag)
  • Thai spelling suggestion and correction (spell and correct)
  • Thai transliteration (transliterate)
  • Thai soundex (soundex) with three engines (lk82, udom83, metasound)
  • Thai collation (sort by dictionary order) (collate)
  • Read out number to Thai words (bahttext, num_to_thaiword)
  • Thai datetime formatting (thai_strftime)
  • Thai-English keyboard misswitched fix (eng_to_thai, thai_to_eng)
  • Command-line interface for basic functions, like tokenization and pos tagging (run thainlp in your shell)

Installation

pip install --upgrade pythainlp

This will install the latest stable release of PyThaiNLP.

Install different releases:

  • Stable release: pip install --upgrade pythainlp
  • Pre-release (near ready): pip install --upgrade --pre pythainlp
  • Development (likely to break things): pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip

Installation Options

Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of [name] immediately after pythainlp:

pip install pythainlp[extra1,extra2,...]
List of possible `extras`
  • full (install everything)
  • attacut (to support attacut, a fast and accurate tokenizer)
  • benchmarks (for word tokenization benchmarking)
  • icu (for ICU, International Components for Unicode, support in transliteration and tokenization)
  • ipa (for IPA, International Phonetic Alphabet, support in transliteration)
  • ml (to support ULMFiT models for classification)
  • thai2fit (for Thai word vector)
  • thai2rom (for machine-learnt romanization)
  • wordnet (for Thai WordNet API)

For dependency details, look at extras variable in setup.py.

Data directory

  • Some additional data, like word lists and language models, may get automatically download during runtime.
  • PyThaiNLP caches these data under the directory ~/pythainlp-data by default.
  • Data directory can be changed by specifying the environment variable PYTHAINLP_DATA_DIR.
  • See the data catalog (db.json) at https://github.com/PyThaiNLP/pythainlp-corpus

Command-Line Interface

Some of PyThaiNLP functionalities can be used at command line, using thainlp command.

For example, displaying a catalog of datasets:

thainlp data catalog

Showing how to use:

thainlp help

Licenses

License
PyThaiNLP Source Code and Notebooks Apache Software License 2.0
Corpora, datasets, and documentations created by PyThaiNLP Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0)
Language models created by PyThaiNLP Creative Commons Attribution 4.0 International Public License (CC-by)
Other corpora and models that may included with PyThaiNLP See Corpus License

Contribute to PyThaiNLP

  • Please do fork and create a pull request :)
  • For style guide and other information, including references to algorithms we use, please refer to our contributing page.

Who uses PyThaiNLP?

You can read INTHEWILD.md.

Citations

If you use PyThaiNLP in your project or publication, please cite the library as follows

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354

or BibTeX entry:

@misc{pythainlp,
    author       = {Wannaphong Phatthiyaphaibun and Korakot Chaovavanich and Charin Polpanumas and Arthit Suriyawongkul and Lalita Lowphansirikul and Pattarawat Chormai},
    title        = {{PyThaiNLP: Thai Natural Language Processing in Python}},
    month        = Jun,
    year         = 2016,
    doi          = {10.5281/zenodo.3519354},
    publisher    = {Zenodo},
    url          = {http://doi.org/10.5281/zenodo.3519354}
}

Sponsors

Logo Description
VISTEC-depa Thailand Artificial Intelligence Research Institute Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute.
MacStadium We get support free Mac Mini M1 from MacStadium for doing Build CI.

Made with ❤️ | PyThaiNLP Team 💻 | "We build Thai NLP" 🇹🇭

We have only one official repository at https://github.com/PyThaiNLP/pythainlp and another mirror at https://gitlab.com/pythainlp/pythainlp
Beware of malware if you use code from mirrors other than the official two at GitHub and GitLab.

More Repositories

1

lexicon-thai

คลังศัพท์ภาษาไทย
Python
133
star
2

WangChanGLM

WangChanGLM 🐘 - The Multilingual Instruction-Following Model
Jupyter Notebook
91
star
3

wisesight-sentiment

Thai Social Media Sentiment Dataset
Jupyter Notebook
74
star
4

attacut

A Fast and Accurate Neural Thai Word Segmenter
Python
74
star
5

pythaiasr

Python Thai Automatic Speech Recognition
Python
47
star
6

tts-thai

Thai TTS
Scheme
34
star
7

classification-benchmarks

Thai text classification benchmarks
34
star
8

nlpo3

Thai Natural Language Processing library in Rust, with Python and Node bindings.
Rust
30
star
9

spelling-check

Thai Spelling Check
Jupyter Notebook
27
star
10

PyThaiTTS

Open Source Thai Text-to-speech library in Python
Jupyter Notebook
27
star
11

nlpforthai.com

NLP For Thai
23
star
12

thaigov-corpus

โครงการเก็บรวบรวมข่าวสารจากเว็บไซต์รัฐบาลไทย
Python
18
star
13

prachathai-67k

News Article Corpus from Prachathai.com
Jupyter Notebook
16
star
14

spaCy-th

Thai in spaCy
Python
14
star
15

thai-sentiment-analysis

Thai sentiment analysis
Python
13
star
16

thai-law

Thai Law Dataset (Act of Parliament)
Jupyter Notebook
13
star
17

spaCy-PyThaiNLP

PyThaiNLP For spaCy
Jupyter Notebook
12
star
18

thai-synonym

The synonym for thai (open source & open data)
Python
12
star
19

padthai

Make Pad Thai From few-shot learning 😉
Jupyter Notebook
12
star
20

docker-thai-tokenizers

Python
11
star
21

thaigov-v2-corpus

Thai News Dataset from Thai government website.
Jupyter Notebook
11
star
22

large-thaiword2vec

The large thai word2vec
Jupyter Notebook
10
star
23

Han-solo

🪿 Han-solo: Thai syllable segmenter
Jupyter Notebook
9
star
24

tokenization-benchmark

Python
9
star
25

MultiEL

Multilingual Entity Linking model by BELA model
Python
8
star
26

thai-g2p-wiktionary-corpus

Thai Grapheme to Phoneme (G2P) Wiktionary Corpus
Jupyter Notebook
7
star
27

thai-sentiment-analysis-dataset

Thai sentiment analysis dataset
7
star
28

mudyom

Python
7
star
29

thaixtransformers

Use Pretraining RoBERTa based Thai language models from VISTEC-depa AI Research Institute of Thailand.
Jupyter Notebook
7
star
30

LEKCut

LEKCut (เล็ก คัด) is a Thai tokenization library that ports the deep learning model to the onnx model.
Jupyter Notebook
7
star
31

Thai-constitution-corpus

Thai Constitution Corpus
6
star
32

pythainlp-corpus

pythainlp-data
HTML
6
star
33

pythainlp_notebook

Notebook for PyThaiNLP
Jupyter Notebook
5
star
34

th-cv-sentences

Thai Common Voice sentences
Jupyter Notebook
5
star
35

thainlp

Simple API for PyThaiNLP
Python
5
star
36

thai-covid-19-situation

Thai covid-19 situation text file from Ministry of Public Health, Thailand
Python
5
star
37

pythainlp-api

Web API for PyThaiNLP
Python
5
star
38

thai-named-entity-recognition-data

ข้อมูลสำหรับทำ NER ภาษาไทย
5
star
39

tutorials

The repository contains tutorial notebooks for the official documentation website.
Jupyter Notebook
5
star
40

thaitts-onnx

Thai Text-to-speech by ONNX runtime
Jupyter Notebook
5
star
41

corpus-komped-poem-windy-part

4
star
42

Thai-Text-Generator

Thai Text Generator
Python
4
star
43

Thai-Data-Privacy

ThaiDP = Thai Data Privacy Tool For Python
Python
4
star
44

Thai-Lao-Parallel-Corpus

Thai Lao Parallel corpus
4
star
45

thaibraille

Thai braille
Python
3
star
46

thaimaimeex

Predict budget from project names of ThaiME
Jupyter Notebook
3
star
47

thai_spacy_model

Thai language model for spaCy
Jupyter Notebook
3
star
48

thai-syllables-cut

Thai syllables segmentation
Python
3
star
49

han-coref

🪿 Han-Coref: Thai Coreference resolution by PyThaiNLP
Jupyter Notebook
3
star
50

MaYom

MaYom (มะยม) - The NEXT PyThaiNLP X
Python
2
star
51

pylexto

LexTo with Python 2 & 3 Wrapper
Java
2
star
52

thainlp-docker

All Thai NLP docker
Dockerfile
2
star
53

Open-Assistant-Thailand

Conversational AI สำหรับทุกคน
2
star
54

thai-named-entity-recognition

Thai Named Entity Recognition
Python
2
star
55

PyThaiGPT

Python
2
star
56

Thai-NER

Thai Named Entity Recognition Corpus & Model
2
star
57

demo

PyThaiNLP Demo
Python
2
star
58

pythainlp.github.io-old

Thai Natural Language Processing in Python.
HTML
2
star
59

G2P

Python
1
star
60

Thai-Common-Voice

Thai Common Voice project
1
star
61

Thai-sentence

1
star
62

explore_text

Data exploration and utility functions for Thai texts
Jupyter Notebook
1
star
63

thaiqa_squad

SQuAD version of thaiqa (https://aiforthai.in.th/corpus.php)
Jupyter Notebook
1
star
64

pythainlp-webdemo

HTML
1
star
65

ThaiWiki-clean

Thai Wikipedia Database dumps to plain text for NLP work
Jupyter Notebook
1
star