• Stars
    star
    587
  • Rank 75,614 (Top 2 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created 7 months ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.

Bonito

Bonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face transformers and vllm libraries.

Bonito

Installation

Create an environment and install the package using the following commands:

conda create -n bonito python=3.9
conda activate bonito
pip install -e .

Basic Usage

To generate synthetic instruction tuning dataset using Bonito, you can use the following code:

from bonito import Bonito
from vllm import SamplingParams
from datasets import load_dataset

# Initialize the Bonito model
bonito = Bonito("BatsResearch/bonito-v1")

# load dataset with unannotated text
unannotated_text = load_dataset(
    "BatsResearch/bonito-experiment",
    "unannotated_contract_nli"
)["train"].select(range(10))

# Generate synthetic instruction tuning dataset
sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    unannotated_text,
    context_col="input",
    task_type="nli",
    sampling_params=sampling_params
)

Supported Task Types

Here we include the supported task types [full name (short form)]: extractive question answering (exqa), multiple-choice question answering (mcqa), question generation (qg), question answering without choices (qa), yes-no question answering (ynqa), coreference resolution (coref), paraphrase generation (paraphrase), paraphrase identification (paraphrase_id), sentence completion (sent_comp), sentiment (sentiment), summarization (summarization), text generation (text_gen), topic classification (topic_class), word sense disambiguation (wsd), textual entailment (te), natural language inference (nli)

You can use either the full name or the short form to specify the task_type in generate_tasks.

Tutorial

We have created a tutorial here for how to use a quantized version of the model in a Google Colab T4 instance. The quantized version was graciously contributed by user alexandreteles. We have an additional tutorial to try out the Bonito model on A100 GPU on Google Colab here.

Citation

If you use Bonito in your research, please cite the following paper:

@article{bonito:arxiv24,
  Author = {Nihal V. Nayak and Yiyang Nan and Avi Trost and Stephen H. Bach},
  Title = {Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation},
  Volume = {arXiv:2402.18334 [cs.CL]},
  Year = {2024}}

More Repositories

1

zsl-kg

Framework for zero-shot learning with knowledge graphs.
Python
109
star
2

wiser

Framework for weakly supervised deep sequence taggers, focused on named entity recognition
Python
80
star
3

csp

Learning to compose soft prompts for compositional zero-shot learning.
Python
79
star
4

menghini-neurips23-code

Exploring prompt tuning with pseudolabels for multiple modalities, learning settings, and training strategies.
Python
40
star
5

alfred

A system for prompted weak supervision.
Python
37
star
6

taglets

Python
18
star
7

safranchik-aaai20-code

Python
15
star
8

labelmodels

Lightweight implementations of generative label models for weakly supervised machine learning
Python
14
star
9

nayak-aclfindings24-code

Python
14
star
10

nplm

A weak supervision framework for (partial) labeling functions
Python
12
star
11

efsl

Extended Few-Shot Learning: Exploiting Existing Resources for Novel Tasks
Python
11
star
12

LexC-Gen

Generate synthetic labeled data for extremely low-resource languages using bilingual lexicons.
Python
11
star
13

nayak-tmlr22-code

Python
8
star
14

yu-aistats22-code

Jupyter Notebook
6
star
15

amcl

Adversarial Multi Class Labeling
Python
5
star
16

fudd

Follow-Up Differential Descriptions: Language Models Resolve Ambiguities for Image Classification
Python
5
star
17

piriyakulkij-mlsys22-code

Python
4
star
18

mazzetto-aistats21-code

Python
3
star
19

mazzetto-neurips22-code

Python
2
star
20

su-bigdata23-code

Code Repository for IEEE BigData 23 Paper "Leveraging Large Language Models for Structure Learning in Prompted Weak Supervision"
Python
2
star
21

mazzetto-icml21-code

Python
1
star
22

mazzetto-arxiv23-code

An Adaptive Method for Weak Supervision with Drifting Data
Python
1
star
23

LexC-Gen-Data-Archive

Data Repository for LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons
1
star