MLE-LLaMA: Multi-Language Enhanced LLaMA
This project aims to make LLaMa understand Chinese, and can generate fluency chinese. We are inspired that LLaMa have learned good English expression and a little alignment prompt can makes it capture Chinese.
-
Token vocabulary support for multi-language. We found that llama tokenizer naturally support for Chinese.
-
Fine-tuning llama script.
(1) download original ckpt from huggingface, and put them into file path
ckpt
.(2)
train.py
original script must be run on 80G A100 and more techniques should be employed.(3)
train_lora.py
lora fine-tuning using pert.Argument Values batch size
128 * 8 epochs
3 cut length
256 learning rate
2e-5 speed
1.02s / it -
Fine-grained english-chinese alignment dataset. We colleced the high-quality English-Chinese pairs and can be download in google drive.
We also found that BELLE provide ckpts and chinese dataset, strongly recommended to refer it.
-
Instructing tuning. We use chinese alpaca and GuanacoDataset for instructing tunning.
-
Open source checkpoints, gradio scripts and cases. We found that LLaMA model tends to generate long sentences.
Reference
[1] https://github.com/facebookresearch/llama
[2] https://github.com/tatsu-lab/stanford_alpaca
[3] https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling