• Stars
    star
    157
  • Rank 238,399 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The open source implementation of ChatGPT, Alpaca, Vicuna and RLHF Pipeline. 从0开始实现一个ChatGPT.
 

中文 | English

Open-ChatGPT: An open-source implementation of ChatGPT

Code License Python 3.9+ Code style: black

Introduction

Open-ChatGPT is a open-source library that allows you to train a hyper-personalized ChatGPT-like ai model using your own data and the least amount of compute possible.

Open-ChatGPT is a general system framework for enabling an end-to-end training experience for ChatGPT-like models. It can automatically take your favorite pre-trained large language models though an OpenAI InstructGPT style three stages to produce your very own high-quality ChatGPT-style model.

We have Impleamented RLHF (Reinforcement Learning with Human Feedback) powered by transformer library and DeepsSpeed. It supports distributed training and offloading, which can fit extremly large models.

If you like the project, please show your support by leaving a star ⭐.

News

  • [2023/05] 🔥 We implement Stanford Alpaca Lora.

  • [2023/05] 🔥 We implement Stanford Alpaca.

  • [2023/04] We released RLHF(Reinforcement Learning with Human Feedback) Pipeline .

  • [2023/03] We released the code OpenChatGPT: An Open-Source libraray to train ChatBot like ChatGPT.

Table of Contents

Data Collection

Instruction Datasets

A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca).

Referring to this (@jianzhnie), we labeled each collected dataset according to the following rules:

(Lang)Lingual-Tags:

  • EN: Instruction datasets in English
  • CN: Instruction datasets in Chinese
  • ML: [Multi-lingual] Instruction datasets in multiple languages

(Task)Task-Tags:

  • MT: [Multi-task] Datasets containing multiple tasks
  • TS: [Task-specific] Datasets tailored for specific tasks

(Gen)Generation-method:

  • HG: [Human Generated Dataset] Datasets created by humans
  • SI: [Self-Instruct] Datasets generated using self-instruct methods
  • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
  • COL: [Collection of Dataset] Dataset made from a collection of other datasets
Project Datasets Org Nums Lang Task Gen Type Src
Chain of Thought cot_data |few_shot_data Google 74771 EN/CN MT HG instruct with cot reasoning annotating CoT on existing data
GPT4all nomic-ai/gpt4all-j-prompt-generations nomic-ai 806199 EN MT COL code, storys and dialogs distillation from GPT-3.5-turbo
GPTeacher GPT-4 General-Instruct |Roleplay-Instruct |Code-Instruct | Toolformer teknium1 29013 EN MT SI general, roleplay, toolformer GPT-4 & toolformer
Guanaco JosephusCheung/GuanacoDataset JosephusCheung 534610 ML MT SI various linguistic tasks text-davinci-003
HC3 Hello-SimpleAI/HC3 Hello-SimpleAI | 万得资讯 37175 EN/CN TS MIX dialogue evaluation human or ChatGPT
HC3-Chinese Hello-SimpleAI/HC3-Chinese Hello-SimpleAI|万得资讯 13k CN TS MIX dialogue evaluation human or ChatGPT
alpaca tatsu-lab/alpaca tatsu-lab 52002 EN MT SI general instruct text-davinci-003
AlpacaDataCleaned yahma/alpaca-cleaned yahma 52k EN MT SI general instruct text-davinci-003
Chinese-LLaMA-Alpaca alpaca_data_zh_51k ymcui(讯飞) 51k CN MT SI general instruct text-davinci-003
Luotuo-Chinese-LLM 骆驼 trans_chinese_alpaca_data LC1332(商汤) 52k CN MT SI general instruct text-davinci-003
Natural Instructions Allen AI 61 task|1.5k task Allen AI 5040134 ML MT COL diverse nlp tasks human annotated datasets collection
belle_cn BelleGroup/train_1M_CN |BelleGroup/train_0.5M_CN BelleGroup(链家) 1079517 CN TS/MT SI general, mathematical reasoning, dialogue

Here, we only list a small part of the instruction tuning dataset list, to find more datasets, please check out the following links: jianzhnie/awesome-instruction-datasets: A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca).

RLHF Datasets

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. Follwing is a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

Project Org Nums Lang Summary
webgpt_comparisons Openai 19,578 English In the WebGPT paper, the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total.
SHP stanfordnlp 349 K English SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP).
rlhf-reward-datasets yitingxie 76.3 k English
Dahoas/full-hh-rlhf Dahoas 112 k English Anthropic's HH dataset reformatted into prompt, chosen, rejected samples.
Dahoas/synthetic-instruct-gptj-pairwise Dahoas English
Dahoas/rm-static Dahoas 76.3k English Split of hh-static used for training reward models after supervised fine-tuning.
Anthropic/hh-rlhf Anthropic 22k English This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data.
Instruction-Tuning-with-GPT-4/GPT-4-LLM Instruction-Tuning-with-GPT-4 52k English Ranked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
thu-coai/Safety-Prompts thu-coai 100k Chinese 中文安全prompts,用于评测和提升大模型的安全性,将模型的输出与人类的价值观对齐。
Chatgpt-Comparison-Detection project

To find more datasets, please check out the following links: jianzhnie/awesome-instruction-datasets: A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca).

Data Preprocessing

We has developed a data preprocessing code that offers a unified interface for various large language models. This code can be used to preprocess data for a variety of purposes, such as Instruct Tuning and RLHF modeling tasks. If you're interested in learning more, please check out the following links to our prompt dataset and data utilities:

Data Fomatting

In our collection, all data has been formatted using the same templates. Each sample follows the following structure:

[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]

Install

git clone https://github.com/jianzhnie/open-chatgpt.git
pip install -r requirements.txt

PEFT

  • If you would like to use LORA along with other parameter-efficient methods, please install peft as an additional dependency.

DeepSpeed

  • If you want to accelerate LLM training using techniques such as pipeline parallelism, gradient checkpointing, and tensor fusion. Please install DeepSpeed.

Instruction Fintune

Fine-tuning Alpaca-7B

We fine-tune our models using standard Hugging Face training code. We fine-tune LLaMA-7B and LLaMA-13B with the following hyperparameters:

Hyperparameter LLaMA-7B LLaMA-13B
Batch size 128 128
Learning rate 2e-5 1e-5
Epochs 3 5
Max length 512 512
Weight decay 0 0

You can use the following command to train Alpaca-7B with 4 x A100 (40GB).

cd examples/alpaca/
python train_alpaca.py \
    --model_name_or_path  'decapoda-research/llama-7b-hf' \
    --data_path tatsu-lab/alpaca  \
    --output_dir work_dir/ \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 5 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1

Using DeepSpeed

If you meet OOM error, consider this.

Naively, fine-tuning a 7B model requires about 7 x 4 x 4 = 112 GB of VRAM. Commands given above enable parameter sharding, so no redundant model copy is stored on any GPU. If you'd like to further reduce the memory footprint, here are some options:

  • Turn on CPU offload for FSDP with --fsdp "full_shard auto_wrap offload". This saves VRAM at the cost of longer runtime.
  • In our experience, DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP with offload. Here's an example to use DeepSpeed stage-3 with 4 GPUs with both parameter and optimizer offload:
pip install deepspeed
cd examples/alpaca/
torchrun --nproc_per_node=8 train_alpaca.py \
    --model_name_or_path  'decapoda-research/llama-7b-hf' \
    --data_path tatsu-lab/alpaca  \
    --output_dir work_dir/  \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 5 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --deepspeed "scripts/ds_config_zero3_auto.json"
  • LoRA fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB.

Fine-tuning Alpaca-7B with Lora

This part reproducing the Stanford Alpaca results using low-rank adaptation (LoRA).

To fine-tune cheaply and efficiently, we use Hugging Face's PEFT as well as Tim Dettmers' bitsandbytes.

This file contains a straightforward application of PEFT to the LLaMA model, as well as some code related to prompt construction and tokenization.

python train_alpaca_lora.py \
    --model_name_or_path  decapoda-research/llama-7b-hf  \
    --data_path tatsu-lab/alpaca  \
    --output_dir work_dir_lora/ \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 5 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1

Inference

This file reads the foundation model from the Hugging Face model hub and the LoRA weights from tloen/alpaca-lora-7b, and runs a Gradio interface for inference on a specified input. Users should treat this as example code for the use of the model, and modify it as needed.

Example usage:

python generate_server.py \
    --model_name_or_path decapoda-research/llama-7b-hf \
    --lora_model_name_or_path  tloen/alpaca-lora-7b

No Enough Memory

If you do not have enough memory, you can enable 8-bit compression by adding --load-8bit to commands above. This can reduce memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend. Alpaca-7B with 8-bit compression can run on a single NVIDIA 3090/4080/T4/V100(16GB) GPU.

python generate_server.py \
    --model_name_or_path decapoda-research/llama-7b-hf \
    --lora_model_name_or_path  tloen/alpaca-lora-7b \
    --load_8bit

Contributing

Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.

License

Openn-ChatGPT is released under the Apache 2.0 license.

Acknowledgements

We appreciate the work by many open-source contributors, especially:

Citation

Please cite the repo if you use the data or code in this repo.

@misc{open-chatgpt,
  author = {jianzhnie},
  title = {Open-ChatGPT, a chatbot based on Llama model},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/jianzhnie/open-chatgpt}},
}

More Repositories

1

Efficient-Tuning-LLMs

Easy and Efficient Finetuning of QLoRA LLMs. (Supported LLama, LLama2, bloom, Baichuan, GLM , Falcon) 大模型高效量化训练+部署.
Python
486
star
2

awesome-text-to-video

A Survey on Text-to-Video Generation/Synthesis.
299
star
3

awesome-instruction-datasets

A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。
200
star
4

AutoTabular

Automatic machine learning for tabular data. ⚡🔥⚡
Python
64
star
5

GigaGAN

Implementation of GigaGAN in pytorch
Python
53
star
6

deep-marl-toolkit

MARLToolkit: The Multi-Agent Rainforcement Learning Toolkit. Include implementation of MAPPO, MADDPG, QMIX, VDN, COMA, IPPO, QTRAN, MAT...
Python
33
star
7

GroupNorm-MXNet

This is the re-implementation of group normalization in MXNet Symbol,Module and Gluon
Python
22
star
8

pyramidbox_pytorch

pytorch实现的Pyramidbox 人脸检测模型, 对原来代码的部分模块进行了修改,更简洁高效
Python
20
star
9

RLToolkit

RLToolkit is a flexible and high-efficient reinforcement learning framework. Include implementation of DQN, AC, ACER, A2C, A3C, PG, DDPG, TRPO, PPO, SAC, TD3 and ....
Python
16
star
10

RFBNet_Pytorch

RFBNet in Pytorch
Python
13
star
11

MultimodalTookit

Incorporate Image, Text and Tabular Data with HuggingFace Transformers
Python
10
star
12

TsFormer

TsFormer is a toolbox that implement transformer models on Time series model
Python
8
star
13

AutoTimm

Auto torch image models: train and evaluation
Python
7
star
14

S3FD_pytorch

pytorch 实现的S3FD,对原来的代码进行了优化,更简洁高效
Python
7
star
15

awesome-open-chatgpt

Open efforts to implement ChatGPT-like models and beyond.
7
star
16

DSFD_pytorch

pytorch 实现的DSFD, 更高效更简洁
Python
6
star
17

MXNet-im2rec_tutorial

如何使用mxnet的im2rec函数制作自己的物体检测数据集
6
star
18

machine_learning_notes

工作学习笔记
Python
5
star
19

ssd_pytorch

pytorch 版本的SSD实现
Python
4
star
20

yolov3_pytorch

pytorch实现的yolov3, 对原来代码的数据读取模块进行了修改,更简洁高效, 修复了原来代码的bugs,支持Pytorch-1.1 更高的版本
Python
4
star
21

age_gender_estimation

age_gender_estimation
Python
4
star
22

awesome-chatgpt

Curated list of awesome tools, demos, docs for ChatGPT and GPT-3
3
star
23

RLZero

A clean and easy implementation of MuZero, AlphaZero and Self-Play reinforcement learning algorithms for any game.
Python
3
star
24

deep_head_pose

deep_head_pose in pytorch
Python
3
star
25

machine-learning-wiki

machine-learning-wiki, 专注于机器学习相关领域的知识汇总,技术收集,笔记记录.
3
star
26

NLP-Toolkit

NLPToolkit is a toolkit for NLP(Natural Language Processing) and LLM(Large Language Models) using Pytorch.
Python
3
star
27

ProteinTransformer

ProteinTransformer is a toolkit using deep learning for protein function annotation
Python
2
star
28

jianzhnie.github.io

Robin's Personal Site. Visit https://jianzhnie.github.io/
HTML
2
star
29

spark-ecosystem

spark-ecosystem
Scala
2
star
30

nlp-toolkit-old

nlp-toolkit
Python
2
star
31

learnc

用来学习 C ++ 编程项目
C++
2
star
32

jianzhnie

1
star
33

RefineDet_Pytorch

RefineDet in Pytorch
Python
1
star
34

self_supervised

self-supervised learning
Python
1
star
35

FaceBoxes

FaceBoxes in Pytorch
Python
1
star
36

pytorch-vit

pytorch-vit
1
star
37

DPSNet

Python
1
star
38

RetinaNet_Pytorch

RetinaNet in Pytorch
Python
1
star
39

retinaface_pytorch

A PyTorch implementation of RetinaFace
Python
1
star
40

CropAndResize_pytorch

CropAndResize in pytorch
Python
1
star
41

HssdRL

HssdRL is a handy and simple scaling of distributed reinforcement learning framework based on Python and PyTorch
Python
1
star