• Stars
    star
    688
  • Rank 65,271 (Top 2 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 1 year ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

awesome-chatgpt-dataset

Alt Text

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

Dataset Name Size Languages Source License
lima 1K English LIMA: Less Is More for Alignment CC BY-NC-SA
im-feeling-curious 3K English This public dataset is an extract from Google's "i'm feeling curious" feature. To learn more about this feature, search for "i'm feeling curious" on Google. -
cc_sbu_align 4K English MiniGPT-4 datadset BSD 3-Clause License
SLF5K 5K English The Summarization with Language Feedback (SLF5K) dataset is an English-language dataset containing 5K unique samples that can be used for the task of abstraction summarization. apache-2.0
blended_skill_talk 7K English A dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge. -
GSM-IC 8K English Grade-School Math with Irrelevant Context (GSM-IC) -
ChatAlpaca 10K English The data currently contain a total of 10,000 conversations with 95,558 utterances. Apache-2.0 license
PKU-SafeRLHF-10K 10K English PKU-SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. -
Dolly 15K English databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. CC 3.0
WebGPT 20K English This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. -
Code Alpaca 20K English Code generation task involving 20,022 samples -
LongForm 28K English The LongForm dataset is created by leveraging English corpus examples with augmented instructions. The LongForm project is subject to a MIT License with custom limitations for restrictions imposed by OpenAI (for the instruction generation part), as well as the license of language models (OPT, LLaMA, and T5).
HC3 37K English, Chinese 37,175 instructions generated by ChatGPT and human -
RefGPT 50K English,chinese we introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content. -
arxiv-math-instruct-50k 50K English Dataset consists of question-answer pairs derived from ArXiv abstracts from math categories -
Traditional Chinese Alpaca Dataset 52K Traditional Chinese Translated from Alpaca Data by ChatGPT API Apache-2.0 license
Cabrita Dataset 52K Portuguese Translated from Alpaca Data
Japanese Alpaca Dataset 52K Japanese Translated from Alpaca Data by ChatGPT API CC By NC 4.0; OpenAI terms of use
Alpaca Dataset 52K English 175 seed instructions by OpenAI API CC By NC 4.0; OpenAI terms of use
Alpaca Data Cleaned 52K English Revised version of Alpaca Dataset -
Alpaca GPT-4 Data 52K English Generated by GPT-4 using Alpaca prompts -
Alpaca GPT-4 Data (Chinese) 52K Chinese Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT -
Dynosaur 66K English Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. Apache-2.0 license
Finance 69K English 68,912 financial related instructions -
evol 70K English This is the training data of WizardLM. -
Vicuna Dataset 75K English ~100k ShareGPT conversations -
InstructionTranslation 80K Multi-lingual Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). MIT
Self-Instruct 82K English We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. -
OASST1 89K Multi-lingual a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. apache-2.0
HH-RLHF 91K English The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. MIT
Guanaco Dataset 98K English, Simplified Chinese, Traditional Chinese HK & TW, Japanese 175 tasks from the Alpaca model GPLv3
InstructionWild 104K English, Chinese 429 seed instructions and follow Alpaca to generate 52K Research only; OpenAI terms of use
Camel Dataset 107K Multi-lingual Role-playing between AIs (Open AI API) -
tapir-cleaned-116k 116K English This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. cc-by-nc-4.0
Tapir-Cleaned 117K English This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. CC BY-NC 4.0
WizardLM_evol_instruct_V2_196k 143K English This datasets contains 143K mixture evolved data of Alpaca and ShareGPT. -
LLaVA Visual Instruct 150K English LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. cc-by-nc-4.0
Prosocial Dialog 166K English 165,681 instructions produced by GPT-3 rewrites questions and human feedback -
COIG 191K Chinese Chinese Open Instruction Generalist (COIG) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora. apache-2.0
Unnatural Instructions 241K English a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. MIT
SHP 358K English SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license
ultrachat 404K English To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. cc-by-nc-4.0
ign_clean_instruct_dataset_500k 509K English This dataset contains ~508k prompt-instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. apache-2.0
ELI5 559K English The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. -
GPT4All Dataset 806K Multi-lingual Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. -
Instruct 889K English 888,969 English instructions, augmentation using AllenAI NLP tools MIT
MOSS 1M Chinese Generated by gpt-3.5-turbo Apache-2.0, AGPL-3.0 licenses
LaMini-Instruction 3M English a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts cc-by-nc-4.0
Natural Instructions 5M Multi-lingual 5,040,134 instructions collected from diverse NLP tasks -
BELLE 10M Chinese The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. Research only; OpenAI terms of use
Firefly 16M Chinese 1,649,398 Chinese instructions in 23 NLP tasks -
OIG-43M Dataset 43M Multi-lingual Together, LAION, and Ontocord.ai. -
xP3 79M Multi-lingual 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks -
CodeParrot - python The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. -
Alpaca-CoT Dataset - Multi-lingual Instruction Data Collection ODC-By
stack-exchange-paired - English This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. cc-by-sa-4.0
LangChainDatasets - English This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. -
ParlAI - English 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. -
GPTeacher - English A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer -
silk-road/Wizard-LM-Chinese-instruct-evol - chinese Wizard-LM-Chinese -
                             |

More Repositories

1

TextRL

Implementation of ChatGPT RLHF (Reinforcement Learning with Human Feedback) on any generation model in huggingface's transformer (blommz-176B/bloom/gpt/bart/T5/MetaICL)
Python
537
star
2

Codec-SUPERB

Audio Codec Speech processing Universal PERformance Benchmark
Python
201
star
3

tw_stocker

keep tracking and store taiwan stock information
Python
100
star
4

TFkit

πŸ€–πŸ“‡ handling multiple nlp task in one pipeline
Python
56
star
5

SpeechMix

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together
Python
41
star
6

vall-e-encodec

Python
41
star
7

BertGenerate

Fine tuning bert for text generation
Jupyter Notebook
38
star
8

asr-trainer

one script for xls-r/xlsr/whisper fine-tuning
Python
37
star
9

aidev

Revolutionize your development workflow with AI-powered code assistance, automating mock tests, suggestions, and unit test generation in a single Python CLI tool.
Python
35
star
10

NLPrep

🍳 NLPrep - dataset tool for many natural language processing task
Python
28
star
11

BDG

Code for "A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies."
Python
27
star
12

Phraseg

Phraseg - δΈ€θ¨€οΌšζ–°θ©žη™ΌηΎε·₯ε…·εŒ…
Jupyter Notebook
26
star
13

wav2vec2-xlsr-multilingual-56

56 language, 1 model Multilingual ASR
Python
23
star
14

FTA

Technical Analysis on Cryptocurrency
Python
23
star
15

ChineseErrorDataset

CGED & CSC
22
star
16

asrp

ASR text preprocessing utility
Python
20
star
17

nlp2go

πŸƒ hosting nlp models in one line
CSS
20
star
18

ipa2

Tools for convert Text to IPA in python
Python
16
star
19

nlp2

βš™οΈTool for NLP - handle file and text
Python
15
star
20

awesome-question-answering-dataset

A list of awesome machine question answering dataset - ζ©Ÿε™¨ε•η­”ζ•Έζ“šι›†
15
star
21

pretrain_bart

training BART from scratch
Python
12
star
22

SnapShare

Linking Your Phone To Computer Browser With Socket.io.
JavaScript
10
star
23

causal-lm-trainer

Python
8
star
24

wav2vec-u-exp

Build and Run Wav2vec Unsupervised Experiment
Dockerfile
8
star
25

whisper-live-asr-demo

run whisper on CPU/GPU server
JavaScript
8
star
26

gpu-info-api

πŸ±β€πŸ’» GPU Info API is an API that provides detailed information about Nvidia, AMD, and Intel GPUs. The information is extracted from Wikipedia and stored in JSON format.
Python
8
star
27

t5lephone

phoneme byt5
Python
7
star
28

MMLM

Toward Multi Modality Language Model - implementation of GPT-4o/Project Astra
Python
7
star
29

llm-estimator

Effortlessly predict training time, loss, and cost for LLM model training
JavaScript
6
star
30

WikiExtractor

Extract Knowledge from wiki dump file
Python
6
star
31

react-media-viewer

Ready to go Media Player Component for React.
JavaScript
6
star
32

dtokenizer

discretize everything into tokens
Python
6
star
33

hubert-cluster-code

Extract clustering feature from hubert
5
star
34

pytorch-tta

Pytorch implementation of "Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning".
Python
5
star
35

GSQA

Generative Spoken Question Answering
Python
4
star
36

taiwan-company-network

ε°η£ε…¬εΈζŠ•θ³‡ι—œδΏ‚εœ–
CSS
4
star
37

DevLEGO

Create your development Env like LEGO blocks, run your projects on any device - be it a PC, Web, Phone or Tablet!
Shell
4
star
38

awesome-evaluation-lm

Collection Of Automated Language Model Assessment
3
star
39

fastpages

Jupyter Notebook
3
star
40

Gossiping-Chinese-Positive-Corpus

PTT ε…«ε¦η‰ˆε•η­”-正青-δΈ­ζ–‡θͺžζ–™
3
star
41

survey-builder

survey builder for human evaluation
JavaScript
3
star
42

voidful

Python
3
star
43

audio-preprocessing-pipeline

Python
3
star
44

DG-Showcase

Showcase for "A BERT-based Distractor Generation Scheme with Multi-tasking and Negative Answer Training Strategies."
CSS
3
star
45

modelhub

3
star
46

hubert-pretrain

using huggingface trainer to pre-train hubert
Python
2
star
47

dpr-multilingual

A multilingual version of DPR
2
star
48

telenotify

Python
2
star
49

tts-corpus-creator

collection of different source of TTS api for generating corpus.
Python
2
star
50

diff-aspect-set-dg

Python
2
star
51

depack

Extract files from any type of archive in command line
Python
2
star
52

Data2QA

Unified QA with different modality input
Python
2
star
53

bindtorchaudio

`bindtorchaudio` is a Python package that allows for easy installation of the `torchaudio` library, which provides audio processing functionalities for the PyTorch machine learning framework.
Python
2
star
54

seq2seq-lm-trainer

This is a simple example of using the T5 model for sequence-to-sequence tasks, leveraging Hugging Face's `Trainer` for efficient model training.
Python
2
star
55

PPA

Prompt Pool Agent
Python
2
star
56

bforce

bruteforce is all you need in a unstable system
Python
1
star
57

twcc-usage-slack-bot

TWCC GPU Usage Notification Slack Bot
Python
1
star
58

shows

lib for system monitoring with CPU/GPU/DISK/MEM/NET
Python
1
star
59

get-stat

lib for system monitoring in Python / Web API (CPU/GPU/DISK/MEM/NET/SERVICE)
1
star
60

NLPrep-Datasets

HTML
1
star
61

pearl

PEARL - Optimize Prompt Selection for Enhanced Answer Performance Using Reinforcement Learning
Python
1
star
62

uni-superb

Python
1
star
63

huggingface_notebook

Jupyter Notebook
1
star
64

superb-website

JavaScript
1
star
65

leverage-lm

small lm + RAG > LLM
1
star
66

fbcrawler

Python
1
star
67

SoundON

Python
1
star