• Stars
    star
    231
  • Rank 173,434 (Top 4 %)
  • Language
    Python
  • Created over 3 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL-2022.

BanglaBERT

This repository contains the official release of the model "BanglaBERT" and associated downstream fine-tuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" published in Findings of the Association for Computational Linguistics: NAACL 2022.

Updates

  • We have released BanglaBERT (small). It can be fine-tuned with as little as 4 GB VRAM!
  • We have released a large variant of BanglaBERT! Have a look here.
  • The Bangla2B+ pretraining corpus is now available upon request! See here.

Table of Contents

Models

The pretrained model checkpoints are available at Huggingface model hub.

To use these models for the supported downstream tasks in this repository see Training & Evaluation.

Note: These models were pretrained using a specific normalization pipeline available here. All finetuning scripts in this repository uses this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is available at the model page.

Datasets

We are also releasing the Bangla Natural Language Inference (NLI) and Bangla Question Answering (QA) datasets introduced in the paper.

Please fill out this Google Form to request access to the Bangla2B+ pretraining corpus.

Setup

For installing the necessary requirements, use the following bash snippet

$ git clone https://github.com/csebuetnlp/banglabert
$ cd banglabert/
$ conda create python==3.7.9 pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash setup.sh 
  • Use the newly created environment for running the scripts in this repository.

Training & Evaluation

To use the pretrained model for finetuning / inference on different downstream tasks see the following section:

  • Sequence Classification.
    • For single sequence classification such as
      • Document classification
      • Sentiment classification
      • Emotion classification etc.
    • For double sequence classification such as
      • Natural Language Inference (NLI)
      • Paraphrase detection etc.
  • Token Classification.
    • For token tagging / classification tasks such as
      • Named Entity Recognition (NER)
      • Parts of Speech Tagging (PoS) etc.
  • Question Answering.
    • For tasks such as,
      • Extractive Question Answering
      • Open-domain Question Answering

Benchmarks

  • Zero-shot cross-lingual transfer-learning
Model Params SC (macro-F1) NLI (accuracy) NER (micro-F1) QA (EM/F1) BangLUE score
mBERT 180M 27.05 62.22 39.27 59.01/64.18 50.35
XLM-R (base) 270M 42.03 72.18 45.37 55.03/61.83 55.29
XLM-R (large) 550M 49.49 78.13 56.48 71.13/77.70 66.59
BanglishBERT 110M 48.39 75.26 55.56 72.87/78.63 66.14
  • Supervised fine-tuning
Model Params SC (macro-F1) NLI (accuracy) NER (micro-F1) QA (EM/F1) BangLUE score
mBERT 180M 67.59 75.13 68.97 67.12/72.64 70.29
XLM-R (base) 270M 69.54 78.46 73.32 68.09/74.27 72.82
XLM-R (large) 550M 70.97 82.40 78.39 73.15/79.06 76.79
sahajBERT 18M 71.12 76.92 70.94 65.48/70.69 71.03
BanglishBERT 110M 70.61 80.95 76.28 72.43/78.40 75.73
BanglaBERT (small) 13M 69.29 76.75 73.41 63.30/69.65 70.38
BanglaBERT 110M 72.89 82.80 77.78 72.63/79.34 77.09
BanglaBERT (large) 335M 71.94 83.41 79.20 76.10/81.50 78.43

The benchmarking datasets are as follows:

Acknowledgements

We would like to thank Intelligent Machines and Google TFRC Program for providing cloud support for pretraining the models.

License

Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Creative Commons License

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@inproceedings{bhattacharjee-etal-2022-banglabert,
    title = "{B}angla{BERT}: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in {B}angla",
    author = "Bhattacharjee, Abhik  and
      Hasan, Tahmid  and
      Ahmad, Wasi  and
      Mubasshir, Kazi Samin  and
      Islam, Md Saiful  and
      Iqbal, Anindya  and
      Rahman, M. Sohel  and
      Shahriyar, Rifat",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.98",
    pages = "1318--1327",
    abstract = "In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed {`}Bangla2B+{'}) by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at \url{https://github.com/csebuetnlp/banglabert} to advance Bangla NLP.",
}

More Repositories

1

xl-sum

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
Python
252
star
2

banglanmt

This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
Python
145
star
3

BanglaNLG

This repository contains the official release of the model "BanglaT5" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaNLG: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla".
Python
81
star
4

CrossSum

This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs" published in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL’23), July 9-14, 2023.
Python
49
star
5

normalizer

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.
Python
35
star
6

banglaparaphrase

This repository contains the code, data, and associated models of the paper titled "BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset", accepted in Proceedings of the Asia-Pacific Chapter of the Association for Computational Linguistics: AACL 2022.
Python
14
star
7

IllusionVQA

This repository contains the data and code of the paper titled "IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models"
Jupyter Notebook
9
star
8

BanglaSocialBias

This is the official repository containing all codes used to generate the results reported in the paper titled "Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias"
Jupyter Notebook
5
star
9

BanglaEmotionBias

This is the official repository containing all codes used to generate the results reported in the paper titled "An Empirical Study of Gendered Stereotypes in Emotional Attributes for Bangla in Multilingual Large Language Models" accepted at the 5th Workshop on Gender Bias in Natural Language Processing (hosted at the ACL 2024 Conference)
Jupyter Notebook
3
star
10

csebuetnlp.github.io

CSS
2
star
11

BanglaContextualBias

This is the official repository containing all codes used to generate the results reported in the paper titled "An Empirical Study on the Characteristics of Bias upon Context Length Variation for Bangla" accepted in Findings of the Association for Computational Linguistics: ACL 2024
Jupyter Notebook
2
star