• Stars
    star
    184
  • Rank 209,187 (Top 5 %)
  • Language
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

InstructionZoo

A collection of open-source Instruction-tuning dataset to train chat-based LLMs (ChatGPT,LLaMA,Alpaca).

This is an on-going project. We will soon add tags to classify the following datasets and continuously update our collection.

Table of Contents

The template

## [owner/project-name](https://github.com/link/to/project)

* Size:
* Language:
* Summary:
* Generation Method:
* Paper:
* HuggingFace: (if applicable)
* Demo: (if applicable)
* License:

The English Instruction Datasets

tatsu-lab/Alpaca

gururise/Cleaned Alpaca

  • Size: 51,713 instructions
  • Language: EN
  • Summary: Cleaned Alpaca Dataset helps solve the folowing issues: Hallucinations, Merged Instructions, Empty outputs, Empty code examples, Instructions to generate images, N/A outputs, Inconsistent input field, Wrong answers, Non-Sensical/Unclear instructions, and Extraneous escape and control characters.
  • HuggingFace: https://huggingface.co/datasets/yahma/alpaca-cleaned
  • License: CC BY NC 4.0

PhoebusSi/Alpaca-COT

  • Language: EN
  • Summary: Alpaca-COT is a datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca.
  • Generateion Method: Use the template provided by FLAN to change the original dataset into various Chain-of-Thoughts forms, and then convert them to the instruction-input-output triplets.
  • HuggingFace: https://huggingface.co/datasets/QingyiSi/Alpaca-CoT
  • License: Apache License

QingyiSi/Alpaca-CoT

  • Empty for now. Soon to update.

orhonovich/unnatural-instructions

  • Size: 240,000 instructions
  • Language: EN
  • Summary: Unnatural Instructions consist of a core dataset of 68,478 instruction-input-output triplets, and a full dataset.
  • Generateion Method:
    • Step 1 (Core Dataset Generation): Collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth, following a strict instruction-input-output format.
    • Step 2 (Template Expansion): Prompt a language model to reformulate the tasks in the core dataset, and collect two alternative formulations for each generated task
  • Paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
  • License:

bigscience/PromptSource

bigscience/P3

allenai/natural-instructions

  • Size: 61 tasks, 61 instructions
  • Language: EN
  • Summary: Natural Instruct v1 is a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances.
  • Generateion Method:
    • Map exist datasets into Instruction Schema.
    • Instruction Schema:
      • Part I - Title + Definition + Things-to-Avoid + Emphasis-and-Caution
      • Part II - Positive Example: Input + Output + Reason
      • Part III - Negative Example: Input + Output + Reason + Suggestions to be modified to be positive
      • Part IV - Prompt
  • Paper: Cross-Task Generalization via Natural Language Crowdsourcing Instructions
  • Demo: https://instructions.apps.allenai.org/
  • License:

allenai/super-natural-instructions

google-research/FLAN 2021

  • Size: 62 tasks
  • Language: EN
  • Summary: FLAN 2021 aggregates 62 text datasets on Tensorflow Datasets into a single mixture. It is currently not public.
  • Generateion Method: Map exist datasets into Instruction Schema.
  • Paper: Finetuned Language Models Are Zero-Shot Learners
  • License:

google-research/FLAN 2022 Collection

LianjiaTech/BELLE 1.5M

LianjiaTech/BELLE 10M

XueFuzhao/InstructionWild

  • Size: 479 seed instructions, 52,191 Chinese instructions, 52,191 English instructions
  • Language: CH, EN
  • Summary: InstructionWild use the same format as Alpaca for fast and easy usage. Its instructions have no input field.
  • Generateion Method:
    • Pick 429 instructions over 700 noisy instructions from Twitter
    • Use a similar method as Alpaca for generating the resulting instructions.
  • License:

ExMix

UnifiedSKG

MetaICL

openai/InstructionGPT

facebookresearch/metasqe/OPT-IML

  • Size: 1,667 tasks, 3,128 instructions
  • Language: EN
  • Summary: OPT-IML dataset expands the Super-Natural-Instructions benchmark with the task collections from multiple existing work on instruction-tuning, cross-task transfer studies, and area-specific task consolidation.
  • Generation Method:
    • Benchmarks included in OPT-IML are Super-Natural-Instructions, PromptSource, CrossFit, FLAN, ExMix, T5, UnifiedSKG, and Reasoning. Authors only kept partial tasks from CrossFit, ExMix and T5 due to the significant overlap.
    • To organize the Instruction schema, authors broadly classify the instructions in these benchmarks into two categories, dataset-level and instance-level.
  • Paper: OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
  • License:

THUDM/GLM-130B

laion/OIG

  • Size: 30 tasks, 43M instructions
  • Language: EN
  • Summary: OIG contains instructions that are created using data augmentation from a diverse collection of data sources, and formatted in a dialogue style (โ€ฆ โ€ฆ pairs).
  • Generation Method:
    • OIG is created by various LAION community members, consisting of 30 datasets and 43M instructions, with the goal of reaching 1 trillion tokens.
    • OIG dataset can be divided roughly into 75% academic datasets, such as P3, Natural instructions and FLAN, and 25% datasets composed of various tasks, such as high school math, python coding and peoty generation.
  • HuggingFace: https://huggingface.co/datasets/laion/OIG
  • Demo: https://github.com/LAION-AI/Open-Assistant
  • License:

baize/baize-chatbot

lightaime/camel

  • Size: 115K instructions
  • Language: EN
  • Summary: Camel dataset introduces a novel communicative agent framework named role-playing.
  • Generation Method:
    • The prompt engineering in Camel consists of three prompts, the task specifier prompt, the assistant system prompt, and the user system prompt. The scenarios in Camel include AI Society and Code.
    • Authors also create Data Generation Prompts to generate meta data by LLMs. 50 assistant roles and 50 user roles are generated for AI Society. 20 programming languages and 50 domains are generated for Code.
  • Paper: CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society
  • HuggingFace: https://huggingface.co/camel-ai
  • Demo: https://www.camel-ai.org/
  • License:

thunlp/UltraChat

  • Size: 657K instructions
  • Language: EN
  • Summary: UltraChat is a multi-round dialogue dataset powered by Turbo APIs, composed of three sectors, namely Questions about the World, Writing and Creation, and Assistance on Existent Materials.
  • Generation Method:
    • Two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response.
    • We instruct the user model with carefully designed prompts to mimic human user behavior and call the two APIs iteratively.
  • HuggingFace: https://huggingface.co/datasets/stingning/ultrachat
  • License:

databrickslabs/doll

  • Size: 7 tasks, 15,000 instructions
  • Language: EN
  • Summary: Dolly is a human-generated corpus, whose categories are Creative Writing, Closed QA, Open QA, Summarization, Information Extraction, Classification and Brainstorming.
  • Generation Method:
    • Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories.
    • For instruction categories that require an annotator to consult a reference text, contributors selected passages from Wikipedia for particular subsets of instruction categories.
  • HuggingFace: https://huggingface.co/datasets/databricks/databricks-dolly-15k
  • License:

Instruction-Tuning-with-GPT-4/GPT-4-LLM

ShareGPT

  • Summary: ShareGPT is an open-source Chrome Extension for you to share your wildest ChatGPT conversations with one click.
  • Generation Method: Collect chats with ChatGPT from its users.
  • Demo: https://sharegpt.com/

stanfordnlp/SHP

  • Size: 18 tasks, 385K instructions
  • Language: EN
  • Summary: SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. It is used to train RLHF reward models and NLG evaluation models.
  • Generation Method:
    • The data is sourced from Reddit, which is a public forum organized into topic-specific fora called subreddits.
    • Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post.
  • Paper: Understanding Dataset Difficulty with V -Usable Information
  • HuggingFace: https://huggingface.co/datasets/stanfordnlp/SHP
  • License:

Anthropic/hh-rlhf

HuggingFaceH4/stack-exchange-preferences

Hellp-SimpleAI/HC3

  • Size: 12 tasks, 37,175 instructions
  • Language: EN, CH
  • Summary: HC3 is a comparison corpus that consists of both human and ChatGPT answers to the same questions.
  • Generation Method:
    • Human Answers Collection: The first part is publicly available question-answering datasets, whose answers are given by experts or high-voted. The second part is built by constructing question-answer pairs from wiki sources.
    • ChatGPT Answers Collection: use ChatGPT to generate answers to the questions in Human Answers Collection
  • Paper: How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection
  • HuggingFace: https://huggingface.co/datasets/Hello-SimpleAI/HC3
  • License: CC-BY-SA

f/awesome-chatgpt-prompts

  • Empty for now. Soon to update.

The Chinese Instruction Datasets

FlagOpen/FlagInstruct

  • Size: 2K tasks, 191,191 instructions in total
  • Language: CH
  • Summary: Chinese Open Instruction Generalist (COIG) is a Chinese instruction dataset consisting of 4 sub-tasks.
  • Generateion Method:
    • Task 1: Translated Instructions (67,798)
      • Translate the following datasets into Chinese: 1,616 task descriptions in Super-Natural-Instruct v2 along with a single instance for each of them; 175 seed tasks in Self-instruct; 66,007 instructions from Unnatural Instructions.
    • Task 2: Exam Instructions (63,532)
      • Exams include The Chinese National College Entrance Examination (้ซ˜่€ƒ), Middle School Entrance Examinations (ไธญ่€ƒ), and Civil Servant Examination (ๅ…ฌๅŠกๅ‘˜่€ƒ่ฏ•).
      • Turn them into Chain-of-Thought (CoT) corpus by extracting six informative elements from original exam questions, including instruction, question context, question, answer, answer analysis, and coarse-grained subject.
    • Task 3: Human Value Alignment Instructions (34,471)
      • Select a set of samples that present shared human values in the Chinese-speaking world, and get 50 seed instructions and 3k resulting instructions.
      • Some additional sets of samples that present regional-culture or country-specific human values are also added.
    • Task 4: Counterfactural Correction Multi-round Chat (13,653)
      • The aim is to alleviate and resolve the pain points of hallucination and factual inconsistency in current LLMs.
      • Based on CN-DBpedia knowledge graph dataset, CCMC has ~13,000 dialogues with an average of 5 rounds per dialogue, resulting in ~65,000 rounds of chat.
    • Leetcode Instructions (11,737)
      • 2,589 programming questions from Leetcode.
  • Paper: Chinese Open Instruction Generalist: A Preliminary Release
  • HuggingFace: https://huggingface.co/datasets/BAAI/COIG
  • License: MIT License

CLUEbenchmark/pCLUE

ydli-ai/CSL

  • Size: 4 tasks, 396,209 instructions
  • Language: CH
  • Summary: CSL is a large-scale Chinese scientific literature dataset.
  • Generation Method:
    • Obtain the paperโ€™s meta-information from the National Engineering Research Center for Science and Technology Resources Sharing Service (NSTR) dated from 2010 to 2020.
    • Label papers with categories and disciplines, with the assistance of volunteers.
    • The data format in CSL is <T,A,K,c,d>, where T is the title, A is the abstract, K is a list of keywords, c is the category label and d is the discipline label.
  • Paper: CSL: A Large-scale Chinese Scientific Literature Dataset
  • License:

YeungNLP/Firefly

  • Size: 23 tasks, 1.1M instructions
  • Language: CH
  • Summary: Firefly dataset is a high-quality Chinese instruction-tuning dataset.
  • Generation Method: For each task, human experts write many templates to ensure the quality and diversity of Firefly dataset.
  • HuggingFace: https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M
  • License:

TsinghuaAI/CUGE

ydli-ai/Chinese-ChatLLaMA

  • Language: Multilingual
  • License:

ZeroPrompt

PlexPt/awesome-chatgpt-prompts-zh

  • Empty for now. Soon to update.

Chinese Alpaca

carbonz0/alpaca-chinese-dataset

  • Size: 20,456 instructions
  • Language: CH
  • Generateion Method: Translate Alpaca into Chinese by machine and then clean.

hikariming/alpaca_chinese_dataset

  • Size: 19,442 instructions
  • Language: CH
  • Generateion Method: Translate Alpaca into Chinese by ChatGPT, and check them by humans

ymcui/Chinese-LLaMA-Alpaca

  • Size: 51,458 instructions
  • Language: CH
  • Generateion Method: Translate Alpaca into Chinese by ChatGPT, and discard some of them.

LC1332/Chinese-alpaca-lora

  • Size: 51,672 instructions
  • Language: CH
  • Generateion Method: Translate Stanford Alpaca dataset into Chinese by ChatGPT.

A-baoYang/alpaca-7b-chinese

  • Size: 20,465 instructions
  • Language: TC
  • Generateion Method: Translate Stanford Alpaca dataset into traditional Chinese using OpenCC.

A-baoYang/alpaca-7b-chinese

  • Size: 124,469 instructions
  • Language: EN, TC
  • Generateion Method: Combine the English instruction/input and traditional Chinese output by ChatGPT.

ntunlplab/traditional-chinese-alpaca

  • Size: 52,002 instructions
  • Language: EN, TC
  • Generateion Method: A Traditional-Chinese version of the Alpaca dataset, whose instruction part is left as English.

ntunlplab/traditional-chinese-alpaca

  • Size: 52,002 instructions
  • Language: EN, TC
  • Generateion Method: An Traditional-Chinese version of the Alpaca dataset, where there are English and traditional Chinese versions of one single instruction.

The Miltilingual Instruction Datasets

bigscience/xP3

  • Size: 83 tasks
  • Language: Multilingual (46 languages)
  • Summary:
    • xP3 is a mixture of 13 training tasks in 46 languages with English prompts.
    • Moreover, there is a xP3 Dataset Family, including the following two datasets:
      • xP3mt is a mixture of 13 training tasks in 46 languages with prompts in 20 languages;
      • xP3all consists of xP3 itself and evaluation datasets adding an additional 3 tasks.
  • Generateion Method: Build on the P3 task taxonomy and add 28 new multilingual datasets.
  • Paper: Crosslingual Generalization through Multitask Finetuning
  • HuggingFace: https://huggingface.co/datasets/bigscience/xP3
  • License:

JosephusCheung/GuanacoDataset

  • Size: 380,835 instructions in total
  • Language: CH, DE, EN, JA, TC
  • Summary: Guanaco dataset builds upon the 175 tasks from Alpaca, containing 3 versions with different sizes and methods.
  • Generateion Method:
    • Original Version (48967): Rewrite 175 Alpaca seed tasks in different languages, and add new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition.
    • Mixed Version (279644): The original 175 tasks were translated into 4 versions and regenerated independently, excluding Deutsch.
    • MIni Version (52224): 52K instrucrion dataset, which is included in the Mixed Version.
  • HuggingFace: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset/tree/main
  • License:

JosephusCheung/GuanacoDataset QA

  • Size: 205,999 instructions in total
  • Language: CH, DE, EN, JA
  • Summary: The Paper/General-QA dataset is a collection of questions and answers constructed for AI-generated papers or general texts in 4 languages. The purpose of this dataset is to generate paragraph-level answers to questions posed about lengthy documents such as PDFs.
  • Generateion Method:
    • The question dataset contains 106,707 questions, and the answer dataset contains 99,292 answers.
    • Similar questions are combined to form a tree-like structure, and graph theory algorithms are used to process user questions, content summaries, and contextual logic.
  • HuggingFace: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset/tree/main/additional
  • License:

The Code Instruction Datasets

sahil280114/codealpaca

  • Size: 20,023 instructions
  • Language: EN
  • Summary:
  • Generateion Method: Self-instuct with prompts to focus on code generation/edting/optimization tasks, using text-davinci-003.
  • HuggingFace:
  • License:

More Repositories

1

LLMZoo

โšกLLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.โšก
Python
2,922
star
2

Medical_NLP

Medical NLP Competition, dataset, large models, paper
2,066
star
3

TextClassificationBenchmark

A Benchmark of Text Classification in PyTorch
Python
601
star
4

HuatuoGPT

HuatuoGPT, Towards Taming Language Models To Be a Doctor. (An Open Medical GPT)
Python
527
star
5

crosstalk-generation

Code and data for crosstalk text generation tasks, exploring whether large models and pre-trained language models can understand humor.
Python
163
star
6

CMB

CMB, A Comprehensive Medical Benchmark in Chinese
Python
124
star
7

Evaluation-of-ChatGPT-on-Information-Extraction

An Evaluation of ChatGPT on Information Extraction task, including Named Entity Recognition (NER), Relation Extraction (RE), Event Extraction (EE) and Aspect-based Sentiment Analysis (ABSA).
Python
121
star
8

qnn

Python
112
star
9

ReasoningNLP

paper list on reasoning in NLP
109
star
10

GrammarGPT

The code and data for GrammarGPT.
Python
87
star
11

complex-order

Python
83
star
12

Huatuo-26M

The Largest-scale Chinese Medical QA Dataset๏ผš with 26,000,000 question answer pairs.
67
star
13

FastLLM

Fast LLM Training CodeBase With dynamic strategy choosing [Deepspeed+Megatron+FlashAttention+CudaFusionKernel+Compiler];
Python
32
star
14

DPTDR

Code for COLING22 paper, DPTDR: Deep Prompt Tuning for Dense Passage Retrieval
Python
25
star
15

GPT-API-Accelerate

The "GPT-API-Accelerate" project provides a set of Python classes for accelerating the process of generating responses to prompts using the OpenAI GPT-3.5 API.
Python
19
star
16

REMOP

Code for the paper: Modular Retrieval for Generalization and Interpretation.
Python
11
star
17

ReaLM

A trainable user simulator
Python
9
star
18

ChatGPT-Detection-PR-HPPT

Codes and dataset for the paper: Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text
Python
9
star
19

Reading-list-of-ChatGPT

7
star
20

Autonomous_Learning

LLMs Could Autonomously Learn Without External Supervision. (An Autonomous Learning Method)
Python
5
star
21

DotaGPT

Chinese Medical instruction-tuning Dataset
Python
5
star
22

HuatuoGPT-R

RAG to reduce medical haluccination.
5
star
23

MultilingualSIFT

MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning
5
star
24

MindedWheeler

Embody_AI with car as Demo
C++
5
star
25

MedJamba

Multilingual Medical Model Based On Jamba
Python
4
star
26

finetune_chatgpt

The example for finetuning chatgpt.
Python
3
star
27

ChatZoo

Chat data for training LLMs
3
star
28

LLMZOO-API-SDK

Python
3
star
29

Overview-of-ChatGPT

3
star
30

LLMFactory

A factory to standardize LLM adaptation through modularization
Python
2
star
31

OpenChatGPT

2
star
32

try_Phoenix2

Phoenix2 code in dev
Python
1
star
33

MLLM-Bench

Evaluating Multi-modal LLMs using GPT-4V
HTML
1
star