• Stars
    star
    200
  • Rank 195,325 (Top 4 %)
  • Language
  • Created over 4 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A collection of Vietnamese Natural Language Processing resources.

Vietnamese Natural Language Processing Resources

Create a pull request or issue to add your works into this list.

Large Language Models

  • GemSUra: Pretrained Large Language Models based on Gemma built by URA (HCMUT).
  • Ghost-7b: This model is fine tuned from HuggingFaceH4/zephyr-7b-beta on a small synthetic datasets (about 200MB) for 50% English and 50% Vietnamese.
  • PhoGPT: They open-source a state-of-the-art 7.5B-parameter generative model series named PhoGPT for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-7B5 and its instruction-following variant PhoGPT-7B5-Instruct.
  • Sailor: Sailor is a suite of Open Language Models tailored for South-East Asia (SEA), focusing on languages such as 🇮🇩Indonesian, 🇹🇭Thai, 🇻🇳Vietnamese, 🇲🇾Malay, and 🇱🇦Lao.
  • SeaLLM): The state-of-the-art multilingual LLM for Southeast Asian (SEA) languages 🇬🇧 🇨🇳 🇻🇳 🇮🇩 🇹🇭 🇲🇾 🇰🇭 🇱🇦 🇲🇲 🇵🇭.
  • ToRoLaMa: The Vietnamese Instruction-Following and Chat Model.
  • Vistral-7B-Chat-function-calling: This model was fine-tuned on Vistral-7B-chat for function calling.
  • Vistral-7B-Chat: Towards a State-of-the-Art Large Language Model for Vietnamese
  • ViGPTQA: LLMs for Vietnamese Question Answering
  • VBD-LLaMA2-Chat: A Conversationally-tuned LLaMA2 for Vietnamese.
  • Vietnamse LLaMA 2: A 7B version of LLaMA 2 with 140GB of Vietnamese text by BKAI Foundation Models Lab.
  • VinaLlaMA: Another collection of Vietnamese LlaMA tuned models.
  • Vietcuna: A series of Vicuna tuned models for Vietnamese.
  • Llama2_vietnamese: A fine-tuned Large Language Model (LLM) for the Vietnamese language based on the Llama 2 model.
  • Vietnamese_LLMs: This project aims to create high-quality Vietnamese instruction datasets and tune several open-source large language models (LLMs). So far, they have released various models, including LLaMa and BLOOMZ. Additionally, they have released five instruction datasets, most of which were generated by GPT-4.

Corpus

For more recent updates, you can consider searching for datasets that include Vietnamese on HuggingFace here: https://huggingface.co/datasets?language=language:vi&sort=trending

  • VN News Corpus: 50GB of uncompressed texts crawled from a wide range ofnews websites and topics.
  • 10000 Vietnamese Books: 10000 Vietnamese Books from 195x.
  • CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
  • Bactrain-X: The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages.
  • OSCAR: 68GB of text data with 12.036.845.359 words.
  • Common Crawl: Open repository of web crawl data.
  • WikiDumps: You can download directly or use scripts from viwik18, viwik19.
  • Vietnamese Treebank: VLSP Project.
  • Vietnamese Stopwords: Vietnamese stopwords.
  • Vietnamese Dictionary: Vietnamese dictionary.
  • vietnamese-wordnet: Vietnamese wordnet.
  • VietnameseWAC: The dataset comprises a substantial collection of Vietnamese text, consisting of 129,781,089 tokens and 106,464,835 words, which have been automatically segmented and labeled as per Kilgarriff, A., and Le-Hong, P., 2012.
  • Vietlex Corpus: Vietlex's Vietnamese Corpus, a pioneering effort in Vietnam since 1998, contains about 80 million syllables from various sources.
  • Lexical Database of Vietnamese: A lexical database of Vietnamese contains various lexical information derived from two Vietnamese corpora.

Text Processing Toolkit

  • coccoc-tokenizer: High performance tokenizer for Vietnamese language. It is written in C++ with Python and Java bindings.
  • RDRSegmenter: Fast and accurate Vietnamese word segmenter (LREC 2018).
  • RDRPOSTagger: Fast and accurate POS and morphological tagging toolkit (EACL 2014).
  • VnCoreNLP: A Vietnamese natural language processing toolkit (NAACL 2018).
  • vlp-tok: Vietnamese text processing library developed in the Scala programming language.
  • ETNLP: A toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings.
  • VietnameseTextNormalizer: Vietnamese Text Normalizer.
  • nnvlp: Neural network-based Vietnamese language processing toolkit.
  • jPTDP: Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018).
  • vi_spacy: Vietnamese language model compatible with Spacy.
  • underthesea: Underthesea - Vietnamese NLP toolkit.
  • vnlp: GATE plugin for Vietnamese language processing.
  • pyvi: Python Vietnamese toolkit.
  • JVnTextPro: Java-based Vietnamese text processing tool.
  • DongDu: C++ implementation of Vietnamese word segmentation tool.
  • VLSP Toolkit: Vietnamese tokenizer from VLSP.
  • vTools: Vietnamese NLP toolkit: Tokenizer, Sentence detector, POS tagger, Phrase chunker.
  • JNSP: Java Implementation of Ngram Statistic Package.

Pre-trained Language Model

  • RoBERTa Vietnamese: Pre-trained embedding using RoBERTa architecture on Vietnamese corpus.
  • PhoBERT: Pre-trained language models for Vietnamese (another implementation of RoBERTa for Vietnamese).
  • ALBERT for Vietnamese: "A Lite" version of BERT for Vietnamese.
  • Vietnamese ELECTRA: Electra pre-trained model using Vietnamese corpus.
  • word2vecVN: Pre-trained Word2Vec models for Vietnamese.

Sentiment Analysis

Benchmark

  • VLSP 2016 Share Task: Sentiment Analysis

    • Train: 5100 sentences (1700 positive, 1700 neutral, 1700 negative).

    • Test: 1050 sentences (350 positive, 350 neutral, 350 negative).

      Model F1 Paper Code
      Perceptron/SVM/Maxent 80.05 DSKTLAB: Vietnamese Sentiment Analysis for Product Reviews
      SVM/MLNN/LSTM 71.44 A Simple Supervised Learning Approach to Sentiment Classification at VLSP 2016
      Ensemble: Random forest, SVM, Naive Bayes 71.22 A Lightweight Ensemble Method for Sentiment Classification Task
      Ensemble: SVM, LR, LSTM, CNN 69.71 An Ensemble of Shallow and Deep Learning Algorithms for Vietnamese Sentiment Analysis
      SVM 67.54 Sentiment Analysis for Vietnamese using Support Vector Machines with application to Facebook comments
      SVM/MLNN 67.23 A Multi-layer Neural Network-based System for Vietnamese Sentiment Analysis at the VLSP 2016 Evaluation Campaign
      Multi-channel LSTM-CNN 59.61 Multi-channel LSTM-CNN model for Vietnamese sentiment analysis official
  • VLSP 2018 Shared Task: Aspect Based Sentiment Analysis

    • Restaurant Dataset: 2961 reviews (train), 1290 reviews (development), 500 reviews (test).

      Model Aspect (F1) Aspect Polarity (F1) Paper Code
      CNN 0.80 Deep Learning for Aspect Detection on Vietnamese Reviews
      SVM 0.77 0.61 NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
      SVM 0.54 0.48 Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task
    • Hotel Dataset: 3000 reviews (training), 2000 reviews (development), 600 reviews (test).

      Model Aspect (F1) Aspect Polarity (F1) Paper Code
      SVM 0.70 0.61 NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
      CNN 0.69 Deep Learning for Aspect Detection on Vietnamese Reviews
      SVM 0.56 0.53 Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task
  • Vietnamese Student's Feedback Corpus (UIT-VSFC)

    • UIT-VSFC consists of over 16,000 sentences for sentiment analysis and topic classification.

      Model Sentiment (F1) Topic (F1) Paper Code
      Bi-LSTM/Word2Vec 0.896 0.92 Deep Learning versus Traditional Classifiers on Vietnamese Student’s Feedback Corpus
      Maximum Entropy Classifier 0.88 0.84 UIT-VSFC: Vietnamese Student’s Feedback Corpus for Sentiment Analysis

Named Entity Recognition

Benchmark

  • VLSP 2016 Shared Task: Named Entity Recognition

    Model F1 Paper Code
    PhoBERT_large 94.7 PhoBERT: Pre-trained language models for Vietnamese official
    vELECTRA + BiLSTM + Attention 94.07 Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
    PhoBERT_base 93.6 PhoBERT: Pre-trained language models for Vietnamese official
    XLM-R 92.0 PhoBERT: Pre-trained language models for Vietnamese
    VnCoreNLP-NER + ETNLP 91.3 ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
    BiLSTM-CNN-CRF + ETNLP 91.1 ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
    VNER: Attentive Neural Network 89.6 Attentive Neural Network for Named Entity Recognition in Vietnamese
    BiLSTM-CNN-CRF 88.3 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit official
    LSTM + CRF 66.07 An investigation of Vietnamese Nested Entity Recognition Models
  • VLSP 2018 Shared Task: Named Entity Recognition

    Model F1 Paper Code
    vELECTRA + BiGRU 90.31 Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
    VIETNER: CRF (ngrams + word shapes + cluster + w2v) 76.63 A Feature-Based Model for Nested Named-Entity RecognitionatVLSP-2018 NER Evaluation Campaign
    ZA-NER 74.70 ZA-NER: Vietnamese Named Entity Recognition at VLSP 2018 Evaluation Campaign

Speech Processing

Corpus:

Project

  • vietTTS: Tacotron + HiFiGAN vocoder for vietnamese datasets.

More Repositories

1

local-rag-example

Build your own ChatPDF and run them locally
Python
265
star
2

local-talking-llm

A talking LLM that runs on your own computer without needing the internet.
Python
207
star
3

offline-crohme

Converting CROHME dataset for Online-handwritting recognition to Offline-handwritting recognition.
Python
37
star
4

llm-sandbox

Lightweight and portable LLM sandbox runtime (code interpreter) Python library.
Python
22
star
5

sentivi

A Simple Tool For Sentiment Analysis
Python
15
star
6

vnondb-extractor

VNOnDB dataset extractor. This dataset can be use for build deep learning model to attack vietnamese handwritten text recognition problem.
Python
15
star
7

visee

Just a typical search engine in this universe 🔥🔥🔥
C++
8
star
8

graph-crdt

A portable on-memory conflict-free replicated graph database.
Python
6
star
9

deepgen

Collection of modern Deep Generative Models
Python
5
star
10

algo-templates

Algorithm templates for competitive programming contest.
C++
4
star
11

multi-mnist

MNIST dataset with multiple digits. This dataset can be use for learning number (more than 1 digit) regconizer model.
Python
4
star
12

cppbook

C++: Những viên gạch đầu tiên
C++
3
star
13

bertvi-sentiment

Fine-tuning BERT-based Pre-Trained Language Models for Vietnamese Sentiment Analysis
Python
3
star
14

facemask

End-to-end face mask detection system
Python
3
star
15

blockchain-demo

A very simple implementation of blockchain and its applications in Python.
HTML
3
star
16

pasc

PASC (Mini Pascal) - 502057 (Programming Language Concepts) Spring 2018-2019 assignment.
Python
3
star
17

vngov-pdf-crawler

Python
2
star
18

unescohackathon-xixforum

XIX forum is a social platform for journalists.
JavaScript
2
star
19

ezforecast

Easy forecasting w/ modern algorithms
2
star
20

vietnamese-speech-synthesis

Vietnamese Speech Synthesis
Python
2
star
21

dotfile

Personal dot files.
Shell
2
star
22

fastapi_example

Example REST API and Websocket in FastAPI
Python
2
star
23

interview-exercise

Interview exercises
Python
2
star
24

word-nearest

An implement to find closet/nearest word (semantic aspect) using word2vec and deeplearning4j.
Java
2
star
25

LensQuery-App

TypeScript
2
star
26

pytorch-vi

PyTorch notebook in Vietnamese
Jupyter Notebook
1
star
27

cv

My brief curriculum vitae
TeX
1
star
28

jenkins-blueocean-ci

Starter script for Jenkins CI - Blueocean with docker and docker-compose available inside.
Shell
1
star
29

anna

Python helpers.
Python
1
star
30

sentrec

Sentiment Graph for Product Recommendation
Python
1
star
31

gdqp_presentation

Personal presentation materials in GDQP
1
star
32

aivivn-timeseries

Jupyter Notebook
1
star
33

gnn

Graph Neural Network Coding with DGL
Python
1
star
34

vietnamese-sentiment-transformers

Vietnamese Sentiment Analysis Using Transformer-based Language Models
Python
1
star
35

datamining

Mid-term project of Data Mining and Knowledge Discovery (505043).
Python
1
star
36

sudokuGenerator

CS 502042 assignment - Sudoku puzzle generator.
Java
1
star
37

algobox

Algorithms package for python
Python
1
star
38

treeVis

BST and AVL tree visualization on static web page (HTML + Javascript) - CS 502043 assignment
JavaScript
1
star
39

contrastive-sts

Contrastive Learning for Semantic Text Similarity
Python
1
star
40

goodreads

My personal reading list
1
star
41

ab2sa

Slot Attention Classifier for Aspect-based Sentiment Analysis
Python
1
star
42

vinacall

Call Center with Voice Cloning and Pre-defined Scripts
1
star
43

vndee-oldsite

© 2018 Duy V. Huynh. Powered by Jekyll using the TeXt Theme.
HTML
1
star
44

dsa-python

Python
1
star
45

LensQuery-Backend

Go
1
star
46

github-stats

Python
1
star
47

webcv

TypeScript
1
star
48

transformervi

PyTorch re-implementation Transformer model for en2vi translation task.
Python
1
star