• Stars
    star
    137
  • Rank 266,121 (Top 6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 4 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PhoNLP: A BERT-based multi-task learning model for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)

logo

PhoNLP: A BERT-based multi-task learning model for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

Although we evaluate PhoNLP on Vietnamese, our usage examples below can directly work for other languages that have gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing, and a pre-trained BERT-based language model available from transformers (e.g. BERT, mBERT, RoBERTa, XLM-RoBERTa).

logo

Details of the PhoNLP model architecture and experimental results can be found in our following paper:

@inproceedings{phonlp,
title     = {{PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing}},
author    = {Linh The Nguyen and Dat Quoc Nguyen},
booktitle = {Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations},
pages     = {1--7},
year      = {2021}
}

Please CITE our paper when PhoNLP is used to help produce published results or incorporated into other software.

Installation

  • Python version >= 3.6; PyTorch version >= 1.4.0
  • PhoNLP can be installed using pip as follows: pip3 install phonlp
  • Or PhoNLP can also be installed from source with the following commands:
     git clone https://github.com/VinAIResearch/PhoNLP
     cd PhoNLP
     pip3 install -e .
    

Usage example: Command lines

To play with the examples using command lines, please install phonlp from the source:

git clone https://github.com/VinAIResearch/PhoNLP
cd PhoNLP
pip3 install -e . 

Training

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir <model_folder_path> \
	--pretrained_lm <transformers_pretrained_model> \
	--lr <float_value> --batch_size <int_value> --num_epoch <int_value> \
	--lambda_pos <float_value> --lambda_ner <float_value> --lambda_dep <float_value> \
	--train_file_pos <path_to_training_file_pos> --eval_file_pos <path_to_validation_file_pos> \
	--train_file_ner <path_to_training_file_ner> --eval_file_ner <path_to_validation_file_ner> \
	--train_file_dep <path_to_training_file_dep> --eval_file_dep <path_to_validation_file_dep>

--lambda_pos, --lambda_ner and --lambda_dep represent mixture weights associated with POS tagging, NER and dependency parsing losses, respectively, and lambda_pos + lambda_ner + lambda_dep = 1.

Example:

cd phonlp/models
python3 run_phonlp.py --mode train --save_dir ./phonlp_tmp \
	--pretrained_lm "vinai/phobert-base" \
	--lr 1e-5 --batch_size 32 --num_epoch 40 \
	--lambda_pos 0.4 --lambda_ner 0.2 --lambda_dep 0.4 \
	--train_file_pos ../sample_data/pos_train.txt --eval_file_pos ../sample_data/pos_valid.txt \
	--train_file_ner ../sample_data/ner_train.txt --eval_file_ner ../sample_data/ner_valid.txt \
	--train_file_dep ../sample_data/dep_train.conll --eval_file_dep ../sample_data/dep_valid.conll

Evaluation

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir <model_folder_path> \
	--batch_size <int_value> \
	--eval_file_pos <path_to_test_file_pos> \
	--eval_file_ner <path_to_test_file_ner> \
	--eval_file_dep <path_to_test_file_dep> 

Example:

cd phonlp/models
python3 run_phonlp.py --mode eval --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--eval_file_pos ../sample_data/pos_test.txt \
	--eval_file_ner ../sample_data/ner_test.txt \
	--eval_file_dep ../sample_data/dep_test.conll 

Annotate a corpus

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir <model_folder_path> \
	--batch_size <int_value> \
	--input_file <path_to_input_file> \
	--output_file <path_to_output_file> 

Example:

cd phonlp/models
python3 run_phonlp.py --mode annotate --save_dir ./phonlp_tmp \
	--batch_size 8 \
	--input_file ../sample_data/input.txt \
	--output_file ../sample_data/output.txt 

Usage example: Python API

import phonlp

# Load the trained PhoNLP model
model = phonlp.load(save_dir='/absolute/path/to/phonlp_tmp')

# Annotate a corpus where each line represents a word-segmented sentence
model.annotate(input_file='/absolute/path/to/input.txt', output_file='/absolute/path/to/output.txt')

# Annotate a word-segmented sentence
model.print_out(model.annotate(text="TΓ΄i Δ‘ang lΓ m_việc tαΊ‘i VinAI ."))

By default, the output for each input sentence is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type:

1	TΓ΄i	P	O	3	sub	
2	Δ‘ang	R	O	3	adv
3	lΓ m_việc	V	O	0	root
4	tαΊ‘i	E	O	3	loc
5	VinAI	Np 	B-ORG	4	prob
6	.	CH	O	3	punct

The output can be formatted following the 10-column CoNLL format where the last column is used to represent NER predictions. This can be done by adding output_type='conll' into the model.annotate() function.

Also, in the model.annotate() function, the value of the parameter batch_size can be adjusted to fit your computer's memory instead of using the default one at 1 (batch_size=1). Here, a larger batch_size would lead to a faster performance speed.

Pre-trained PhoNLP model for Vietnamese

import phonlp

# Automatically download the pre-trained PhoNLP model for Vietnamese
# and save it in a local machine folder
phonlp.download(save_dir='/absolute/path/to/pretrained_phonlp')

# Load the pre-trained PhoNLP model for Vietnamese
model = phonlp.load(save_dir='/absolute/path/to/pretrained_phonlp')

# Annotate a corpus where each line represents a word-segmented sentence
model.annotate(input_file='/absolute/path/to/input.txt', output_file='/absolute/path/to/output.txt')

# Annotate a word-segmented sentence
model.print_out(model.annotate(text="TΓ΄i Δ‘ang lΓ m_việc tαΊ‘i VinAI ."))

Using VnCoreNLP to perform word and sentence segmentation on raw Vietnamese texts

In case the input Vietnamese texts are raw, i.e. without word and sentence segmentation, a word segmenter must be applied to produce word-segmented sentences before feeding to the pre-trained PhoNLP model for Vietnamese. Users should use VnCoreNLP to perform word and sentence segmentation (as it produces the same Vietnamese tone normalization that was applied to the data of the POS tagging, NER and dependency parsing tasks).

Installation

pip3 install py_vncorenlp

Example usage

import py_vncorenlp

# Automatically download VnCoreNLP components from the original repository
# and save them in some local machine folder
py_vncorenlp.download_model(save_dir='/absolute/path/to/vncorenlp')

# Load VnCoreNLP for word and sentence segmentation
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir='/absolute/path/to/vncorenlp')

# Perform word and sentence segmentation 
print(rdrsegmenter.word_segment("Γ”ng Nguyα»…n KhαΊ―c ChΓΊc  Δ‘ang lΓ m việc tαΊ‘i Đẑi học Quα»‘c gia HΓ  Nα»™i. BΓ  Lan, vợ Γ΄ng ChΓΊc, cΕ©ng lΓ m việc tαΊ‘i Δ‘Γ’y."))
# ['Γ”ng Nguyα»…n_KhαΊ―c_ChΓΊc Δ‘ang lΓ m_việc tαΊ‘i Đẑi_học Quα»‘c_gia HΓ _Nα»™i .', 'BΓ  Lan , vợ Γ΄ng ChΓΊc , cΕ©ng lΓ m_việc tαΊ‘i Δ‘Γ’y .']

More Repositories

1

PhoGPT

PhoGPT: Generative Pre-training for Vietnamese (2023)
Python
720
star
2

PhoBERT

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
658
star
3

BERTweet

BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Python
573
star
4

WaveDiff

Official Pytorch Implementation of the paper: Wavelet Diffusion Models are fast and scalable Image Generators (CVPR'23)
Python
372
star
5

CPM

πŸ’„ Lipstick ain't enough: Beyond Color-Matching for In-the-Wild Makeup Transfer (CVPR 2021)
Python
364
star
6

XPhoneBERT

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech (INTERSPEECH 2023)
Python
292
star
7

Anti-DreamBooth

Anti-DreamBooth: Protecting users from personalized text-to-image synthesis (ICCV 2023)
Python
205
star
8

LFM

Official PyTorch implementation of the paper: Flow Matching in Latent Space
Python
184
star
9

blur-kernel-space-exploring

Exploring Image Deblurring via Blur Kernel Space (CVPR'21)
Python
137
star
10

dict-guided

Dictionary-guided Scene Text Recognition (CVPR-2021)
Python
126
star
11

VinAI_Translate

A Vietnamese-English Neural Machine Translation System (INTERSPEECH 2022)
123
star
12

MagNet

Progressive Semantic Segmentation (CVPR-2021)
Python
114
star
13

Warping-based_Backdoor_Attack-release

WaNet - Imperceptible Warping-based Backdoor Attack (ICLR 2021)
Python
111
star
14

HyperInverter

HyperInverter: Improving StyleGAN Inversion via Hypernetwork (CVPR 2022)
Python
111
star
15

BARTpho

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese (INTERSPEECH 2022)
99
star
16

PhoWhisper

PhoWhisper: Automatic Speech Recognition for Vietnamese (2024)
96
star
17

ISBNet

ISBNet: a 3D Point Cloud Instance Segmentation Network with Instance-aware Sampling and Box-aware Dynamic Convolution (CVPR 2023)
Python
93
star
18

Dataset-Diffusion

Dataset Diffusion: Diffusion-based Synthetic Data Generation for Pixel-Level Semantic Segmentation (NeurIPS2023)
Jupyter Notebook
87
star
19

JointIDSF

BERT-based joint intent detection and slot filling with intent-slot attention mechanism (INTERSPEECH 2021)
Python
84
star
20

3D-UCaps

3D-UCaps: 3D Capsules Unet for Volumetric Image Segmentation (MICCAI 2021)
Python
65
star
21

PhoNER_COVID19

COVID-19 Named Entity Recognition for Vietnamese (NAACL 2021)
63
star
22

PCC-pytorch

A pytorch implementation of the paper "Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control"
Python
59
star
23

Counting-DETR

Few-shot Object Counting and Detection (ECCV 2022)
Python
56
star
24

PSENet-Image-Enhancement

PSENet: Progressive Self-Enhancement Network for Unsupervised Extreme-Light Image Enhancement (WACV 2023)
Python
54
star
25

LeMul

Toward Realistic Single-View 3D Object Reconstruction with Unsupervised Learning from Multiple Images (ICCV 2021)
Python
51
star
26

DSW

Distributional Sliced-Wasserstein distance code
Python
47
star
27

PhoMT

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation (EMNLP 2021)
40
star
28

single_image_hdr

Single-Image HDR Reconstruction by Multi-Exposure Generation (WACV 2023)
Python
38
star
29

SwiftBrush

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation (CVPR 2024)
Python
37
star
30

tise-toolbox

TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation (ECCV 2022)
Python
33
star
31

Point-Unet

Point-Unet: A Context-aware Point-based Neural Network for Volumetric Segmentation (MICCAI 2021)
Python
32
star
32

COVID19Tweet

WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets
Python
30
star
33

CREPS

Efficient Scale-Invariant Generator with Column-Row Entangled Pixel Synthesis (CVPR 2023)
Python
30
star
34

ViText2SQL

ViText2SQL: A dataset for Vietnamese Text-to-SQL semantic parsing (EMNLP-2020 Findings)
28
star
35

input-aware-backdoor-attack-release

Input-aware Dynamic Backdoor Attack (NeurIPS 2020)
Python
27
star
36

QC-StyleGAN

QC-StyleGAN - Quality Controllable Image Generation and Manipulation (NeurIPS 2022)
Python
26
star
37

fsvc-ata

Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments (ECCV 2022)
Python
23
star
38

GeoFormer

Geodesic-Former: a Geodesic-Guided Few-shot 3D Point Cloud Instance Segmenter (ECCV 2022)
Python
23
star
39

PhoST

A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation (INTERSPEECH 2022)
19
star
40

MISCA

MISCA: A Joint Model for Multiple Intent Detection and Slot Filling with Intent-Slot Co-Attention (EMNLP 2023 - Findings)
Python
18
star
41

PC3-pytorch

Predictive Coding for Locally-Linear Control (ICML-2020)
Python
16
star
42

Open3DIS

Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance (CVPR 2024)
Python
16
star
43

EFHQ

Code and data for the CVPR24 paper "EFHQ: Multi-purpose ExtremePose-Face-HQ dataset" [CVPR'24]
Python
15
star
44

TPC-tensorflow

Temporal Predictive Coding For Model-Based Planning In Latent Space (ICML-2021)
Python
14
star
45

iFS-RCNN

iFS-RCNN: An Incremental Few-shot Instance Segmenter (CVPR 2022)
Python
14
star
46

GaPro

GaPro: Box-Supervised 3D Point Cloud Instance Segmentation Using Gaussian Processes as Pseudo Labelers (ICCV 2023)
Python
13
star
47

HyperCUT

HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering (CVPR'23)
Python
12
star
48

LP-OVOD

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing (WACV 2024)
Python
11
star
49

selfsup_pcd

Self-Supervised Learning with Multi-View Rendering for 3D Point Cloud Analysis (ACCV 2022)
Python
8
star
50

PointSWD

Point-set Distances for Learning Representations of 3D Point Clouds (ICCV 2021)
Python
7
star
51

PhoATIS_Disfluency

From Disfluency Detection to Intent Detection and Slot Filling (INTERSPEECH 2022)
7
star
52

JPIS

JPIS: A Joint Model for Profile-Based Intent Detection and Slot Filling with Slot-to-Intent Attention (ICASSP 2024)
Python
6
star
53

SA-DPM

Official PyTorch implementation of "On Inference Stability for Diffusion Models" (AAAI'24)
Python
5
star
54

PhoDisfluency

Disfluency Detection for Vietnamese (WNUT 2022)
4
star
55

DiverseDream

DiverseDream: A Technique to Generate Diverse 3D Objects from the Same Text Prompt (ECCV '24)
Python
3
star
56

robust-bayesian-recourse

Robust Bayesian Recourse: a robust model-agnostic algorithmic recourse method (UAI'22)
Python
2
star
57

RDUOT

Official code for ECCV 2024 paper β€œA high-quality robust diffusion framework for corrupted dataset”
Python
1
star
58

LAMPAT

LAMPAT: Low-rank Adaptation Multilingual Paraphrasing using Adversarial Training (AAAI'24)
Python
1
star