LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model [Paper]
[3/18] We release a new project named Mipha
Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models [Paper] [Code]
Our Mipha-3B outperforms many existing 3B MLLMs, including Bunny-3B/MobileVLM-v2, using much less training data. We also analyze the design space of small multimodal models with some new findings. Check out our paper and give it a try!
[1/26] Now you can download our model weight.
[1/15] Our model and training codes are released.
[1/5] Our codes are currently undergoing an internal review and will be released shortly (expected next week)
- Clone this repository and navigate to llava-phi folder
git clone https://github.com/zhuyiche/llava-phi.git
cd llava-phi
- Install Package
conda create -n llava_phi python=3.10 -y
conda activate llava_phi
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Download model weight at huggingface
The training curve can be found at wandb
LLaVA-Phi training consists of two stages: (1) feature alignment stage: use LLaVA-1.5 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: visual instruction tuning stage: use 150K GPT-generated multimodal instruction-following data, plus around 515K VQA data from academic-oriented tasks, to teach the model to follow multimodal instructions.
We use a similar set of hyperparameters as LLaVA-1.5 in both pretraining and finetuning phase. Both hyperparameters used in pretraining and finetuning are provided below.
- Pretraining
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-Phi | 256 | 1e-3 | 1 | 2048 | 0 |
- Finetuning
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-Phi | 128 | 2e-5 | 1 | 2048 | 0 |
Our base model is phi-2. You should download the weights from here, and change the --model_name_or_path
in get_base_model.sh
.
Our vision encoder is ViT-L/14 336px. You should download the weights from here.
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions from here.
Then, you should integrate phi-2 and ViT-L/14 336px into a single model by running the following script:
bash ./script/llava_phi/get_base_model.sh
cp ./openai/clip-vit-large-patch14-336/preprocessor_config.json ./base_checkpoints_llava_phi
bash ./scripts/llava_phi/pretrain.sh
cp ./openai/clip-vit-large-patch14-336/preprocessor_config.json ./checkpoints/llavaPhi-v0-3b-pretrain
Please refer here to prepare the instruction tuning data.
Training script with DeepSpeed ZeRO-3: finetune.sh
.
bash ./scripts/llava_phi/finetune.sh
cp ./openai/clip-vit-large-patch14-336/preprocessor_config.json ./checkpoints/llavaPhi-v0-3b-finetune
To ensure the reproducibility, we evaluate the models with greedy decoding.
See Evaluation.md.
If you find LLaVA-Phi useful for your research and applications, please cite using this BibTeX:
@misc{zhu2024llavaphi,
title={LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model},
author={Yichen Zhu and Minjie Zhu and Ning Liu and Zhicai Ou and Xiaofeng Mou and Jian Tang},
year={2024},
eprint={2401.02330},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{zhu2024comprehensive,
title={A Comprehensive Overhaul of Multimodal Assistant with Small Language Models},
author={Zhu, Minjie and Zhu, Yichen and Liu, Xin and Liu, Ning and Xu, Zhiyuan and Shen, Chaomin and Peng, Yaxin and Ou, Zhicai and Feng, Feifei and Tang, Jian},
journal={arXiv preprint arXiv:2403.06199},
year={2024}
}
We build our project based on
- LLaVA: an amazing open-sourced project for vision language assistant
- LLaMA-Factory: We use this codebase to finetune Phi model