• Stars
    star
    1,073
  • Rank 43,114 (Top 0.9 %)
  • Language
    Python
  • License
    MIT License
  • Created about 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TextBox 2.0 is a text generation library with pre-trained language models

TextBox Logo


TextBox 2.0 (妙笔)

“李太白少时,梦所用之笔头上生花后天才赡逸,名闻天下。”——王仁裕《开元天宝遗事·梦笔头生花》

TextBox 2.0: A Text Generation Library with Pre-trained Language Models

TextBox 2.0 is an up-to-date text generation library based on Python and PyTorch focusing on building a unified and standardized pipeline for applying pre-trained language models to text generation:

  • From a task perspective, we consider 13 common text generation tasks such as translation, story generation, and style transfer, and their corresponding 83 widely-used datasets.
  • From a model perspective, we incorporate 47 pre-trained language models/modules covering the categories of general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight models (modules).
  • From a training perspective, we support 4 pre-training objectives and 4 efficient and robust training strategies, such as distributed data parallel and efficient generation.

Compared with the previous version of TextBox, this extension mainly focuses on building a unified, flexible, and standardized framework for better supporting PLM-based text generation models. There are three advantages of TextBox 2.0:

  • It is a significant innovation focusing on comprehensive tasks and PLMs.
  • It is designed to be unified in implementation and interface.
  • It can faithfully reproduce the results reported in existing work.

TextBox 2.0 framework
The Overall Framework of TextBox 2.0

Installation

Considering that a modified version of transformers will be installed, it is recommended to create a new conda environment:

conda create -n TextBox python=3.8

Then, you can clone our repository and install it with one-click.

git clone https://github.com/RUCAIBox/TextBox.git && cd TextBox
bash install.sh

If you face a issue ROUGE-1.5.5.pl - XML::Parser dependency error when installing files2rouge, you can refer to this issue.

Quick Start

This is a script template to run TextBox 2.0 in an end-to-end pipeline:

python run_textbox.py --model=<model-name> --dataset=<dataset-name> --model_path=<hf-or-local-path>

Substitute --model=<xxx> , --dataset=<xxx> and --model_path=<xxx> with your choices.

The choices of model and model_path can be found in Model. We provide the detailed instruction of each model in that page.

The choices of dataset can be found in Dataset. You should download the dataset at https://huggingface.co/RUCAIBox and put the downloaded dataset under the dataset folder just like samsum. If your want to use your own dataset, please refer to here.

The script below will run the Facebook BART-base model on the samsum dataset:

python run_textbox.py --model=BART --dataset=samsum --model_path=facebook/bart-base

Training

Basic Training

For basic training, we provide a detailed tutorial (here) for setting commonly used parameters like optimizer, scheduler, validation frequency, early stopping, and so on.

Pre-training

TextBox 2.0 provides four pre-training objectives to help users pre-train a model from scratch, including language modeling, masked sequence-to-sequence modeling, denoising auto-encoding, and masked span prediction. See the pre-training doc for a detailed tutorial.

Efficient Training

Four useful training methods are provided for improving the optimization of PLMs: distributed data parallel, efficient decoding, hyper-parameter optimization, and repeated experiments. Detailed instructions are provided here.

Model

To support the rapid progress of PLMs on text generation, TextBox 2.0 incorporates 47 models/modules, covering the categories of general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight models (modules). See the model doc for information on detailed usage instructions of each model, pre-trained model parameters, and generation parameters.

Dataset

Now we support 13 generation tasks (e.g., translation and story generation) and their corresponding 83 datasets. We also provide the description, basic statistics, training/validation/testing samples, and leaderboard for each dataset. See more details here.

Evaluation

TextBox 2.0 supports 17 automatic metrics of 4 categories and several visualization tools to explore and analyze the generated texts in various dimensions. For evaluation details, see the evaluation doc.

Releases

Releases Date Features
v2.0.1 24/12/2022 TextBox 2.0
v2.0.0 20/08/2022 TextBox 2.0 Beta
v0.2.1 15/04/2021 TextBox
v0.1.5 01/11/2021 Basic TextBox

Contributing

Please let us know if you encounter a bug or have any suggestions by filing an issue.

We welcome all contributions from bug fixes to new features and extensions.

We expect all contributions discussed in the issue tracker and going through PRs.

We thank @LucasTsui0725 for contributing HRED model and several evaluation metrics.

We thank @wxDai for contributing PointerNet and more than 20 language models in transformers API.

The Team

TextBox is developed and maintained by AI Box.

License

TextBox uses MIT License.

Reference

If you find TextBox 2.0 useful for your research or development, please cite the following papers:

@inproceedings{tang-etal-2022-textbox,
    title = "{T}ext{B}ox 2.0: A Text Generation Library with Pre-trained Language Models",
    author = "Tang, Tianyi  and  Li, Junyi  and  Chen, Zhipeng  and  Hu, Yiwen  and  Yu, Zhuohao  and  Dai, Wenxun  and  Zhao, Wayne Xin  and  Nie, Jian-yun  and  Wen, Ji-rong",
    booktitle = "Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-demos.42",
    pages = "435--444",
}


@inproceedings{textbox,
    title = "{T}ext{B}ox: A Unified, Modularized, and Extensible Framework for Text Generation",
    author = "Li, Junyi  and  Tang, Tianyi  and  He, Gaole  and  Jiang, Jinhao  and  Hu, Xiaoxuan  and  Xie, Puzhao  and  Chen, Zhipeng  and  Yu, Zhuohao  and  Zhao, Wayne Xin  and  Wen, Ji-Rong",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-demo.4",
    doi = "10.18653/v1/2021.acl-demo.4",
    pages = "30--39",
}

More Repositories

1

LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".
Python
10,176
star
2

RecBole

A unified, comprehensive and efficient recommendation library
Python
3,387
star
3

Awesome-RSPapers

Recommender System Papers
937
star
4

RecSysDatasets

This is a repository of public data sources for Recommender Systems (RS).
Python
808
star
5

LLMBox

A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
Python
599
star
6

CRSLab

CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).
Python
496
star
7

HaluEval

This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.
Python
392
star
8

Top-conference-paper-list

A collection of classified and organized top conference paper list.
360
star
9

LLMRank

[ECIR'24] Implementation of "Large Language Models are Zero-Shot Rankers for Recommender Systems"
Python
229
star
10

DenseRetrieval

200
star
11

Negative-Sampling-Paper

This repository collects 100 papers related to negative sampling methods.
185
star
12

RecBole2.0

An up-to-date, comprehensive and flexible recommendation library
180
star
13

RecBole-GNN

Efficient and extensible GNNs enhanced recommender library based on RecBole.
Python
170
star
14

UniSRec

[KDD'22] Official PyTorch implementation for "Towards Universal Sequence Representation Learning for Recommender Systems".
Python
163
star
15

NCL

[WWW'22] Official PyTorch implementation for "Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning".
Python
117
star
16

RSPapers

Must-read papers on Recommender System. 推荐系统相关论文整理(内含40篇论文,并持续更新中)
89
star
17

RecBole-CDR

This is a library built upon RecBole for cross-domain recommendation algorithms
Python
85
star
18

MVP

This repository is the official implementation of our paper MVP: Multi-task Supervised Pre-training for Natural Language Generation.
68
star
19

VQ-Rec

[WWW'23] PyTorch implementation for "Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders".
Python
62
star
20

RecBole-PJF

Python
51
star
21

Language-Specific-Neurons

Python
42
star
22

ChatCoT

The official repository of "ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models"
Python
41
star
23

CORE

[SIGIR'22] Official PyTorch implementation for "CORE: Simple and Effective Session-based Recommendation within Consistent Representation Space".
Python
37
star
24

BAMBOO

Python
32
star
25

JiuZhang3.0

The code and data for the paper JiuZhang3.0
Python
32
star
26

Multi-View-Co-Teaching

Code for our CIKM 2020 paper "Learning to Match Jobs with Resumes from Sparse Interaction Data using Multi-View Co-Teaching Network"
Python
29
star
27

JiuZhang

Our code will be public soon .
Python
26
star
28

ELMER

This repository is the official implementation of our EMNLP 2022 paper ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation
Python
26
star
29

RecBole-DA

Python
20
star
30

CARP

Python
16
star
31

SAFE

The pytorch implementation of the SAFE model presented in NAACL-Findings-2022
Python
16
star
32

Erya

14
star
33

RecBole-TRM

Python
13
star
34

MML

Python
12
star
35

Context-Tuning

This is the repository for COLING 2022 paper "Context-Tuning: Learning Contextualized Prompts for Natural Language Generation".
11
star
36

UniWeb

The official repository for our ACL 2023 Findings paper: The Web Can Be Your Oyster for Improving Language Models
10
star
37

FIGA

[ICLR 2024] This is the official implementation for the paper: "Beyond imitation: Leveraging fine-grained quality signals for alignment"
Python
8
star
38

PPGM

[ICDM'22] PyTorch implementation for "Privacy-Preserved Neural Graph Similarity Learning".
Python
6
star
39

Social-Datasets

A collection of social datasets for RecBole-GNN.
6
star
40

Contrastive-Curriculum-Learning

Python
5
star
41

LIVE

The official repository our ACL 2023 paper: "Learning to Imagine: Visually-Augmented Natural Language Generation"."
Python
5
star
42

ALLO

The official repository of "Low-Redundant Optimization for Large Language Model Alignment''
Python
5
star
43

M3SRec

4
star
44

Data-CUBE

3
star
45

Div-Ref

The official repository of "Not All Metrics Are Guilty: Improving NLG Evaluation Diversifying References".
Python
3
star
46

GenRec

Python
1
star
47

ETRec

Python
1
star
48

xLSTM-LSR

Python
1
star
49

MoL-TSR

Python
1
star
50

L2P-CSR

The implementation code of the TASLP 2023 paper "Learning to Perturb for Contrastive Learning of Unsupervised Sentence Representations"
Python
1
star