• Stars
    star
    573
  • Rank 77,865 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)

Table of contents

  1. Introduction
  2. Main results
  3. Using BERTweet with transformers
  4. Using BERTweet with fairseq

BERTweet: A pre-trained language model for English Tweets

BERTweet is the first public large-scale language model pre-trained for English Tweets. BERTweet is trained based on the RoBERTa pre-training procedure. The corpus used to pre-train BERTweet consists of 850M English Tweets (16B word tokens ~ 80GB), containing 845M Tweets streamed from 01/2012 to 08/2019 and 5M Tweets related to the COVID-19 pandemic. The general architecture and experimental results of BERTweet can be found in our paper:

@inproceedings{bertweet,
title     = {{BERTweet: A pre-trained language model for English Tweets}},
author    = {Dat Quoc Nguyen and Thanh Vu and Anh Tuan Nguyen},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
pages     = {9--14},
year      = {2020}
}

Please CITE our paper when BERTweet is used to help produce published results or is incorporated into other software.

Main results

postagging        ner

sentiment        irony

Using BERTweet with transformers

Installation

  • Install transformers with pip: pip install transformers, or install transformers from source.
    Note that we merged a slow tokenizer for BERTweet into the main transformers branch. The process of merging a fast tokenizer for BERTweet is in the discussion, as mentioned in this pull request. If users would like to utilize the fast tokenizer, the users might install transformers as follows:
git clone --single-branch --branch fast_tokenizers_BARTpho_PhoBERT_BERTweet https://github.com/datquocnguyen/transformers.git
cd transformers
pip3 install -e .
  • Install tokenizers with pip: pip3 install tokenizers

Pre-trained models

Model #params Arch. Max length Pre-training data
vinai/bertweet-base 135M base 128 850M English Tweets (cased)
vinai/bertweet-covid19-base-cased 135M base 128 23M COVID-19 English Tweets (cased)
vinai/bertweet-covid19-base-uncased 135M base 128 23M COVID-19 English Tweets (uncased)
vinai/bertweet-large 355M large 512 873M English Tweets (cased)
  • 09/2020: Two pre-trained models vinai/bertweet-covid19-base-cased and vinai/bertweet-covid19-base-uncased are resulted by further pre-training the pre-trained model vinai/bertweet-base on a corpus of 23M COVID-19 English Tweets.
  • 08/2021: Released vinai/bertweet-large.

Example usage

import torch
from transformers import AutoModel, AutoTokenizer 

bertweet = AutoModel.from_pretrained("vinai/bertweet-large")

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-large")

# INPUT TWEET IS ALREADY NORMALIZED!
line = "DHEC confirms HTTPURL via @USER :crying_face:"

input_ids = torch.tensor([tokenizer.encode(line)])

with torch.no_grad():
    features = bertweet(input_ids)  # Models outputs are now tuples
    
## With TensorFlow 2.0+:
# from transformers import TFAutoModel
# bertweet = TFAutoModel.from_pretrained("vinai/bertweet-large")

Normalize raw input Tweets

Before applying BPE to the pre-training corpus of English Tweets, we tokenized these Tweets using TweetTokenizer from the NLTK toolkit and used the emoji package to translate emotion icons into text strings (here, each icon is referred to as a word token). We also normalized the Tweets by converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively. Thus it is recommended to also apply the same pre-processing step for BERTweet-based downstream applications w.r.t. the raw input Tweets.

Given the raw input Tweets, to obtain the same pre-processing output, users could employ our TweetNormalizer module.

  • Installation: pip3 install nltk emoji==0.6.0
  • The emoji version must be either 0.5.4 or 0.6.0. Newer emoji versions have been updated to newer versions of the Emoji Charts, thus not consistent with the one used for pre-processing our pre-training Tweet corpus.
import torch
from transformers import AutoTokenizer
from TweetNormalizer import normalizeTweet

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-large")

line = normalizeTweet("DHEC confirms https://postandcourier.com/health/covid19/sc-has-first-two-presumptive-cases-of-coronavirus-dhec-confirms/article_bddfe4ae-5fd3-11ea-9ce4-5f495366cee6.html?utm_medium=social&utm_source=twitter&utm_campaign=user-share… via @postandcourier 😢")

input_ids = torch.tensor([tokenizer.encode(line)])

Using BERTweet with fairseq

Please see details at HERE!

License

MIT License

Copyright (c) 2020-2021 VinAI Research

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

More Repositories

1

PhoGPT

PhoGPT: Generative Pre-training for Vietnamese (2023)
Python
720
star
2

PhoBERT

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
658
star
3

WaveDiff

Official Pytorch Implementation of the paper: Wavelet Diffusion Models are fast and scalable Image Generators (CVPR'23)
Python
372
star
4

CPM

💄 Lipstick ain't enough: Beyond Color-Matching for In-the-Wild Makeup Transfer (CVPR 2021)
Python
364
star
5

XPhoneBERT

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech (INTERSPEECH 2023)
Python
292
star
6

Anti-DreamBooth

Anti-DreamBooth: Protecting users from personalized text-to-image synthesis (ICCV 2023)
Python
205
star
7

LFM

Official PyTorch implementation of the paper: Flow Matching in Latent Space
Python
184
star
8

blur-kernel-space-exploring

Exploring Image Deblurring via Blur Kernel Space (CVPR'21)
Python
137
star
9

PhoNLP

PhoNLP: A BERT-based multi-task learning model for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)
Python
137
star
10

dict-guided

Dictionary-guided Scene Text Recognition (CVPR-2021)
Python
126
star
11

VinAI_Translate

A Vietnamese-English Neural Machine Translation System (INTERSPEECH 2022)
123
star
12

MagNet

Progressive Semantic Segmentation (CVPR-2021)
Python
114
star
13

Warping-based_Backdoor_Attack-release

WaNet - Imperceptible Warping-based Backdoor Attack (ICLR 2021)
Python
111
star
14

HyperInverter

HyperInverter: Improving StyleGAN Inversion via Hypernetwork (CVPR 2022)
Python
111
star
15

BARTpho

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese (INTERSPEECH 2022)
99
star
16

PhoWhisper

PhoWhisper: Automatic Speech Recognition for Vietnamese (2024)
96
star
17

ISBNet

ISBNet: a 3D Point Cloud Instance Segmentation Network with Instance-aware Sampling and Box-aware Dynamic Convolution (CVPR 2023)
Python
93
star
18

Dataset-Diffusion

Dataset Diffusion: Diffusion-based Synthetic Data Generation for Pixel-Level Semantic Segmentation (NeurIPS2023)
Jupyter Notebook
87
star
19

JointIDSF

BERT-based joint intent detection and slot filling with intent-slot attention mechanism (INTERSPEECH 2021)
Python
84
star
20

3D-UCaps

3D-UCaps: 3D Capsules Unet for Volumetric Image Segmentation (MICCAI 2021)
Python
65
star
21

PhoNER_COVID19

COVID-19 Named Entity Recognition for Vietnamese (NAACL 2021)
63
star
22

PCC-pytorch

A pytorch implementation of the paper "Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control"
Python
59
star
23

Counting-DETR

Few-shot Object Counting and Detection (ECCV 2022)
Python
56
star
24

PSENet-Image-Enhancement

PSENet: Progressive Self-Enhancement Network for Unsupervised Extreme-Light Image Enhancement (WACV 2023)
Python
54
star
25

LeMul

Toward Realistic Single-View 3D Object Reconstruction with Unsupervised Learning from Multiple Images (ICCV 2021)
Python
51
star
26

DSW

Distributional Sliced-Wasserstein distance code
Python
47
star
27

PhoMT

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation (EMNLP 2021)
40
star
28

single_image_hdr

Single-Image HDR Reconstruction by Multi-Exposure Generation (WACV 2023)
Python
38
star
29

SwiftBrush

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation (CVPR 2024)
Python
37
star
30

tise-toolbox

TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation (ECCV 2022)
Python
33
star
31

Point-Unet

Point-Unet: A Context-aware Point-based Neural Network for Volumetric Segmentation (MICCAI 2021)
Python
32
star
32

COVID19Tweet

WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets
Python
30
star
33

CREPS

Efficient Scale-Invariant Generator with Column-Row Entangled Pixel Synthesis (CVPR 2023)
Python
30
star
34

ViText2SQL

ViText2SQL: A dataset for Vietnamese Text-to-SQL semantic parsing (EMNLP-2020 Findings)
28
star
35

input-aware-backdoor-attack-release

Input-aware Dynamic Backdoor Attack (NeurIPS 2020)
Python
27
star
36

QC-StyleGAN

QC-StyleGAN - Quality Controllable Image Generation and Manipulation (NeurIPS 2022)
Python
26
star
37

fsvc-ata

Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments (ECCV 2022)
Python
23
star
38

GeoFormer

Geodesic-Former: a Geodesic-Guided Few-shot 3D Point Cloud Instance Segmenter (ECCV 2022)
Python
23
star
39

PhoST

A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation (INTERSPEECH 2022)
19
star
40

MISCA

MISCA: A Joint Model for Multiple Intent Detection and Slot Filling with Intent-Slot Co-Attention (EMNLP 2023 - Findings)
Python
18
star
41

PC3-pytorch

Predictive Coding for Locally-Linear Control (ICML-2020)
Python
16
star
42

Open3DIS

Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance (CVPR 2024)
Python
16
star
43

EFHQ

Code and data for the CVPR24 paper "EFHQ: Multi-purpose ExtremePose-Face-HQ dataset" [CVPR'24]
Python
15
star
44

TPC-tensorflow

Temporal Predictive Coding For Model-Based Planning In Latent Space (ICML-2021)
Python
14
star
45

iFS-RCNN

iFS-RCNN: An Incremental Few-shot Instance Segmenter (CVPR 2022)
Python
14
star
46

GaPro

GaPro: Box-Supervised 3D Point Cloud Instance Segmentation Using Gaussian Processes as Pseudo Labelers (ICCV 2023)
Python
13
star
47

HyperCUT

HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering (CVPR'23)
Python
12
star
48

LP-OVOD

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing (WACV 2024)
Python
11
star
49

selfsup_pcd

Self-Supervised Learning with Multi-View Rendering for 3D Point Cloud Analysis (ACCV 2022)
Python
8
star
50

PointSWD

Point-set Distances for Learning Representations of 3D Point Clouds (ICCV 2021)
Python
7
star
51

PhoATIS_Disfluency

From Disfluency Detection to Intent Detection and Slot Filling (INTERSPEECH 2022)
7
star
52

JPIS

JPIS: A Joint Model for Profile-Based Intent Detection and Slot Filling with Slot-to-Intent Attention (ICASSP 2024)
Python
6
star
53

SA-DPM

Official PyTorch implementation of "On Inference Stability for Diffusion Models" (AAAI'24)
Python
5
star
54

PhoDisfluency

Disfluency Detection for Vietnamese (WNUT 2022)
4
star
55

DiverseDream

DiverseDream: A Technique to Generate Diverse 3D Objects from the Same Text Prompt (ECCV '24)
Python
3
star
56

robust-bayesian-recourse

Robust Bayesian Recourse: a robust model-agnostic algorithmic recourse method (UAI'22)
Python
2
star
57

RDUOT

Official code for ECCV 2024 paper “A high-quality robust diffusion framework for corrupted dataset”
Python
1
star
58

LAMPAT

LAMPAT: Low-rank Adaptation Multilingual Paraphrasing using Adversarial Training (AAAI'24)
Python
1
star