• Stars
    star
    596
  • Rank 75,095 (Top 2 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

骆驼:A Chinese finetuned instruction LLaMA. Developed by 陈启源 @ 华中师范大学 & 李鲁鲁 @ 商汤科技 & 冷子昂 @ 商汤科技

骆驼(Luotuo): Chinese-alpaca-lora

骆驼(Luotuo) is the Chinese pinyin(pronunciation) of camel

Specifically, this repo is for vanilla Luotuo, which a Chinese finetuned instruction LLaMA, belongs the project 骆驼(Luotuo).

Project 骆驼(Luotuo) was found by 冷子昂 @ 商汤科技, 陈启源 @ 华中师范大学(Junior Undergrad.) and 李鲁鲁 @ 商汤科技

silk-magic-book

  • Now this repo will only contain the information about Vanilla-Luotuo, which Chinese finetuned on LLaMA, for other LLM story, will be gradually move to the Project 骆驼(Luotuo).

  • Please visit our home page repo https://github.com/LC1332/Luotuo-Chinese-LLM to see more information.

  • 对于更多信息,请访问我们的主页面 https://github.com/LC1332/Luotuo-Chinese-LLM , Chinese-alpaca-lora这个仓库将只用于存储LLaMA有关的内容。如果你希望一个更好的中文语言模型,参考主页中的驼铃项目。

This is NOT an official product of SenseTime

We named project in Luotuo(Camel) because both LLaMA and alpaca are all belongs to Artiodactyla-Camelidae(偶蹄目-骆驼科)

News [ ... ]

[2023-4-4] For Luotuo, we are working on 1.0 and 1.3 version. Training with larger data, and fixing the Chinese tokenizer issue. We will try to align the Chinese performance, and conduct a more fair comparison between LLaMA and GLM model.

[2023-3-30] We released Chinese Summarization Model, CamelBell-C (驼铃-C), try in this Open In Colab. More result see in CamelBell-repo.

[2023-3-27] We plan to train a ChatHarryPotter, we've just finished the prelimiary experiment and have ver. 0.1 model, but it did not meet our expectation, see this report, and we are pursuing a Harry Potter enthusiast Pythoner to join.

[2023-3-25] Luotuo-1.0 is in training! Thanks for all sponsors!

A Quick Start

Colab Link detail
CamelBell quick evaluation Open In Colab Tuoling specific Evaluation Code
A quick evaluation Open In Colab Evaluation code with standard HuggingFace pipeline
Bot with Interface Open In Colab Interactive Chatting Bot using Gradio
Training Code To be released Training code, run on colab
Data Translation Open In Colab Translation alpaca.json into Chinese

Trained Model

Model Name Training Data and Setting
luotuo-lora-7b-0.1 Trained on Translated alpaca 52k data
luotuo-lora-7b-0.3 Trained on Translated alpaca 52k data + guanaco, 1 epoch
luotuo-lora-7b-0.9 (In Plan) cleaned alpaca 52k + full guanaco

luotuo-lora-7b-0.3 the performance seems have significant improvement than 0.1, even just after 1 epoch.

But 1 epoch cost us 7 hours to train, which cost more than $10. We will keep continue tuning it later.

Qualitative Result(0.3)

Evaluation code are now in notebook/evaluation_code.ipynb

Good Examples

Input: 中国的首都在哪里?
Luotuo-Output: 中国的首都是北京。
Input: 麦当劳是哪个国家的?
Luotuo-Output: 美国。
Input: 2017年美国总统是谁
Luotuo-Output: 2017年美国总统是特朗普
Input: 华中师范大学在哪里
Luotuo-0.1-Output: 华中师范大学位于北京
Luotuo-0.3-Output: 华中师范大学在武汉市。

this example failed in 0.1

Bad Examples

Input: 商汤科技是做什么的?
Luotuo-0.1-Output: 商汤科技是一种技术,它用于创建互联网应用程序和服务。
Luotuo-0.3-Output: 商汤科技是一种技术,它可以用于创建新的产品和服务,以增加产品和服务的吸引力。它可以用于创建新的产品和服务,以增加产品和服务的吸引力。它可以用于创建新的产品和服务,以增加产品和服务的吸引力。它可以用于创建新的产品和服务,以增加产品和服务的吸引力。它可以用于创建新的产品和服务,以增加产品和服务的吸引力。它可以用于创建新的产品和服务,以增加产品和服务的吸引力。它可以用于创建新的
Input: 猫和狗的关系是什么
Luotuo-Output: 猫和狗之间的关系是友好的。它们通常在一起休息或玩耍。猫和狗之间的关系通常很好,它们通常在一起休息或玩耍。猫和狗之间的关系通常很好,它们通常在一起休息或玩耍。猫和狗之间的关系通常很好,它们通常在一起休息或玩耍。猫和狗之间的关系通常很好,它们通常在一起休息或玩耍。猫和狗之间的关系通常很好,它们通常在一起休息或玩耍。猫和狗之间的关系通常

Training

We have tuned a Chinese LLaMA model baed on LLaMA, Stanford Alpaca, Alpaca LoRA, cabrita, Japanese-Alpaca-LoRA

The training code in in cleaning, if you are in very hurry, check the Japanese project and simply change the json training data file name.

Data

This is an inbuilding project

The training code only made a slightly change on the Japanese-Alpaca-LoRA

A. 0.1 version model was trained on translated data, which translate the alpaca_data.json to Chinese using ChatGPT API. We paid around US $30-45 to translate the full dataset to chinese. Translated data is available. (trans_chinese_alpaca_data.json)

B. We are also plan to consider the data in Guanaco hikariming's alpaca_chinese_dataset and carbonz0‘s alpaca-chinese-dataset, may updated it into later version.

We plan to upload two different models A and B, because the provider of B claim the clean data will bring significant improvement.

Sponsorships(赞助)

Top 3 Sponsors

Time Sponsor Amount
2023/3/28 张** 2000
2023/3/25 肖** 520
2023/3/24 *潇 518

balance = 5792 now. Detailed balance see in sponsorship_and_balance.md

这原本是我们的一个作业项目,我们原本计划训练到1.0为止。但是社区的热情超过了我们的想象。如果您愿意赞助我们的项目,可以

扫描这个二维码

并且加这个支付宝账号,留下您的姓名

项目的资金流向将被公开,所有的资金将被用于数据的标注,训练算力的购买或者后续周边产品的发放。数据和算力的捐献也会一同总结在sponsorship的表格中。备用链接 二维码 , 支付宝账号

This was originally an exercise project for us, and we originally planned to train until version 1.0. However, the enthusiasm of the community exceeded our expectations. If you are willing to sponsor our project, you can scan this QR code and add this Alipay account, leaving your name.

All funds will be used for data annotation, purchase of training computing power, or distribution of subsequent peripheral products.

TODO and Be a Contributor

It seems that there are many follow-up tasks to be done after the basic version is completed. Many developers in the community have put forward more friendly suggestions, and I have put a longer TODO list in TODO_list.md.

inbuilding project

  • translate alpaca json data into Chinese
  • finetuning with lora(model 0.1)
  • release 0.1 model (model A)
  • model to hugging face, GUI demo
  • train lora with more alpaca data(model 0.3)
  • (In Processing) train lora with more alpaca data(model 0.9)
  • clean training code
  • write the second phase plan for Luotuo

We plan to use this Luotuo project as the git repository for the entire Chinese LLM project. After the completion of the original Luotuo: LLaMA-LoRA, it will be migrated to Luotuo-vanilla. The CamelBell, Loulan, Silk-Road and other derivative Chinese language model projects will gradually be added to the Luotuo project.

Citation

Please cite the repo if you use the data or code in this repo.

@misc{alpaca,
  author={Ziang Leng, Qiyuan Chen and Cheng Li},
  title = {Luotuo: An Instruction-following Chinese Language model, LoRA tuning on LLaMA},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/LC1332/Chinese-alpaca-lora}},
}

More Repositories

1

Luotuo-Chinese-LLM

骆驼(Luotuo): Open Sourced Chinese Language Models. Developed by 陈启源 @ 华中师范大学 & 李鲁鲁 @ 商汤科技 & 冷子昂 @ 商汤科技
Jupyter Notebook
3,627
star
2

Chat-Haruhi-Suzumiya

Chat凉宫春日, An open sourced Role-Playing chatbot Cheng Li, Ziang Leng, and others.
Jupyter Notebook
1,812
star
3

Luotuo-Text-Embedding

Luotuo Embedding(骆驼嵌入) is a text embedding model, which developed by 李鲁鲁, 冷子昂, 陈启源, 蒟蒻等.
Jupyter Notebook
257
star
4

CamelBell-Chinese-LoRA

CamelBell(驼铃) is be a Chinese Language Tuning project based on LoRA. CamelBell is belongs to Project Luotuo(骆驼), an open sourced Chinese-LLM project created by 冷子昂 @ 商汤科技 & 陈启源 @ 华中师范大学 & 李鲁鲁 @ 商汤科技
Jupyter Notebook
172
star
5

Zero-Haruhi

The plan which extend ChatHaruhi into Zero-shot Roleplaying model
Jupyter Notebook
90
star
6

Luotuo-QA

骆驼QA,中文大语言阅读理解模型。
Jupyter Notebook
71
star
7

Haruhi-2-Dev

Just for debug
Jupyter Notebook
56
star
8

Luotuo-Silk-Magic-Book

The Silk Magic Book will record the Magic Prompts on some very Large LLMs. The Silk Magic Book belongs to the project Luotuo(骆驼), which created by 李鲁鲁, 冷子昂, 陈启源
Jupyter Notebook
38
star
9

Luotuo-Silk-Road

Silk Road will be the dataset zoo for Luotuo(骆驼). Luotuo is an open sourced Chinese-LLM project founded by 陈启源 @ 华中师范大学 & 李鲁鲁 @ 商汤科技 & 冷子昂 @ 商汤科技
Python
38
star
10

Luotuo-Fighter

骆驼大乱斗: Massive Game Content Generated by LLM
Jupyter Notebook
18
star
11

NovelLearner

写作的prompt实验和结果
9
star
12

Luotuo-Paper-Reading

骆驼读论文,中文大语言模型的Paper Reading。
8
star
13

Loulan-Chinese-Text-Summarization

Loulan(楼兰) will be a Chinese text summarization project using modern Large Language Model, which trained with LoRA. Loulan is belongs to Luotuo(骆驼), an open sourced Chinese-LLM project created by 陈启源 @ 华中师范大学 & 李鲁鲁 @ 商汤科技 & 冷子昂 @ 商汤科技
5
star
14

Embed-Adapter

Embedding adapter between BGE famaily and openai etc.
Jupyter Notebook
4
star
15

Suzumiya-Diffusion-Learning

春日社区的SD学习
Jupyter Notebook
3
star
16

Needy-Haruhi

AIGC-Galgame via Dynamic Memory
Jupyter Notebook
1
star
17

CourseraScrollViewDemoBackUp

Tot coursera swift course demo
1
star
18

Hand_Detection

Pixel Level Hand Detection
1
star
19

VRTK-Vive-EmptyTemplate

Unity Setup for VRTK & SteamVR
C#
1
star
20

VRTKSuperHot

A VRTK4 + Vive version of superHot like game demo
C#
1
star
21

LcBasic

李鲁鲁平时使用的一些基础代码
Python
1
star
22

FruitNinjaVRTK

A VR fruitNinja build with VRTK4
C#
1
star
23

VRTKBoxing

VR BoxingDemo with VRTK4 + Vive
C#
1
star
24

Learn-Python-with-GPT

李鲁鲁老师的 Copilot-Python 学习。和ChatGPT等大语言模型协同进化。
Jupyter Notebook
1
star
25

simple-face-recognition

Face Recognition Baseline Revisited
Jupyter Notebook
1
star
26

courseraScrollView

scroll view demo
Swift
1
star
27

Lubao-KidLearn

鲁宝早教机 的公开展示页面
1
star