lima |
1K |
English |
LIMA: Less Is More for Alignment |
CC BY-NC-SA |
im-feeling-curious |
3K |
English |
This public dataset is an extract from Google's "i'm feeling curious" feature. To learn more about this feature, search for "i'm feeling curious" on Google. |
- |
cc_sbu_align |
4K |
English |
MiniGPT-4 datadset |
BSD 3-Clause License |
SLF5K |
5K |
English |
The Summarization with Language Feedback (SLF5K) dataset is an English-language dataset containing 5K unique samples that can be used for the task of abstraction summarization. |
apache-2.0 |
blended_skill_talk |
7K |
English |
A dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge. |
- |
GSM-IC |
8K |
English |
Grade-School Math with Irrelevant Context (GSM-IC) |
- |
ChatAlpaca |
10K |
English |
The data currently contain a total of 10,000 conversations with 95,558 utterances. |
Apache-2.0 license |
PKU-SafeRLHF-10K |
10K |
English |
PKU-SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. |
- |
Dolly |
15K |
English |
databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. |
CC 3.0 |
WebGPT |
20K |
English |
This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. |
- |
Code Alpaca |
20K |
English |
Code generation task involving 20,022 samples |
- |
LongForm |
28K |
English |
The LongForm dataset is created by leveraging English corpus examples with augmented instructions. |
The LongForm project is subject to a MIT License with custom limitations for restrictions imposed by OpenAI (for the instruction generation part), as well as the license of language models (OPT, LLaMA, and T5). |
HC3 |
37K |
English, Chinese |
37,175 instructions generated by ChatGPT and human |
- |
RefGPT |
50K |
English,chinese |
we introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content. |
- |
arxiv-math-instruct-50k |
50K |
English |
Dataset consists of question-answer pairs derived from ArXiv abstracts from math categories |
- |
Traditional Chinese Alpaca Dataset |
52K |
Traditional Chinese |
Translated from Alpaca Data by ChatGPT API |
Apache-2.0 license |
Cabrita Dataset |
52K |
Portuguese |
Translated from Alpaca Data |
|
Japanese Alpaca Dataset |
52K |
Japanese |
Translated from Alpaca Data by ChatGPT API |
CC By NC 4.0; OpenAI terms of use |
Alpaca Dataset |
52K |
English |
175 seed instructions by OpenAI API |
CC By NC 4.0; OpenAI terms of use |
Alpaca Data Cleaned |
52K |
English |
Revised version of Alpaca Dataset |
- |
Alpaca GPT-4 Data |
52K |
English |
Generated by GPT-4 using Alpaca prompts |
- |
Alpaca GPT-4 Data (Chinese) |
52K |
Chinese |
Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT |
- |
Dynosaur |
66K |
English |
Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. |
Apache-2.0 license |
Finance |
69K |
English |
68,912 financial related instructions |
- |
evol |
70K |
English |
This is the training data of WizardLM. |
- |
Vicuna Dataset |
75K |
English |
~100k ShareGPT conversations |
- |
InstructionTranslation |
80K |
Multi-lingual |
Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). |
MIT |
Self-Instruct |
82K |
English |
We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. |
- |
OASST1 |
89K |
Multi-lingual |
a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. |
apache-2.0 |
HH-RLHF |
91K |
English |
The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. |
MIT |
Guanaco Dataset |
98K |
English, Simplified Chinese, Traditional Chinese HK & TW, Japanese |
175 tasks from the Alpaca model |
GPLv3 |
InstructionWild |
104K |
English, Chinese |
429 seed instructions and follow Alpaca to generate 52K |
Research only; OpenAI terms of use |
Camel Dataset |
107K |
Multi-lingual |
Role-playing between AIs (Open AI API) |
- |
tapir-cleaned-116k |
116K |
English |
This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. |
cc-by-nc-4.0 |
Tapir-Cleaned |
117K |
English |
This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. |
CC BY-NC 4.0 |
WizardLM_evol_instruct_V2_196k |
143K |
English |
This datasets contains 143K mixture evolved data of Alpaca and ShareGPT. |
- |
LLaVA Visual Instruct |
150K |
English |
LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. |
cc-by-nc-4.0 |
Prosocial Dialog |
166K |
English |
165,681 instructions produced by GPT-3 rewrites questions and human feedback |
- |
COIG |
191K |
Chinese |
Chinese Open Instruction Generalist (COIG) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora. |
apache-2.0 |
Unnatural Instructions |
241K |
English |
a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. |
MIT |
SHP |
358K |
English |
SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. |
Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license |
ultrachat |
404K |
English |
To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. |
cc-by-nc-4.0 |
ign_clean_instruct_dataset_500k |
509K |
English |
This dataset contains ~508k prompt-instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. |
apache-2.0 |
ELI5 |
559K |
English |
The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. |
- |
GPT4All Dataset |
806K |
Multi-lingual |
Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. |
- |
Instruct |
889K |
English |
888,969 English instructions, augmentation using AllenAI NLP tools |
MIT |
MOSS |
1M |
Chinese |
Generated by gpt-3.5-turbo |
Apache-2.0, AGPL-3.0 licenses |
LaMini-Instruction |
3M |
English |
a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts |
cc-by-nc-4.0 |
Natural Instructions |
5M |
Multi-lingual |
5,040,134 instructions collected from diverse NLP tasks |
- |
BELLE |
10M |
Chinese |
The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. |
Research only; OpenAI terms of use |
Firefly |
16M |
Chinese |
1,649,398 Chinese instructions in 23 NLP tasks |
- |
OIG-43M Dataset |
43M |
Multi-lingual |
Together, LAION, and Ontocord.ai. |
- |
xP3 |
79M |
Multi-lingual |
78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks |
- |
CodeParrot |
- |
python |
The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. |
- |
Alpaca-CoT Dataset |
- |
Multi-lingual |
Instruction Data Collection |
ODC-By |
stack-exchange-paired |
- |
English |
This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. |
cc-by-sa-4.0 |
LangChainDatasets |
- |
English |
This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. |
- |
ParlAI |
- |
English |
100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. |
- |
GPTeacher |
- |
English |
A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer |
- |
silk-road/Wizard-LM-Chinese-instruct-evol |
- |
chinese |
Wizard-LM-Chinese |
- |