π Honeybee: Locality-enhanced Projector for Multimodal LLM
This is an official PyTorch Implementation of Honeybee: Locality-enhanced Projector for Multimodal LLM, Junbum Cha*, Wooyoung Kang*, Jonghwan Mun*, Byungseok Roh. [paper]
Catalog
Coming soon:
- Arxiv
- Inference code
- Checkpoints
- Training code
Selected Examples
Environment
- PyTorch
2.0.1
pip install -r requirements.txt
# additional requirements for demo
pip install -r requirements_demo.txt
Model Zoo
We use MMB, MME, SEED-Bench, and LLaVA-Bench (in-the-wild) for model evaluation.
MMB, SEED-I, and LLaVA-w indicate MMB dev split, SEED-Bench images, and LLaVA-Bench (in-the-wild), respectively.
- Comparison with other SoTA methods (Table 6)
Model | Checkpoint | MMB | MME | SEED-I | LLaVA-w |
---|---|---|---|---|---|
Honeybee-C-7B-M144 | download | 70.1 | 1891.3 | 64.5 | 67.1 |
Honeybee-D-7B-M144 | download | 70.8 | 1835.5 | 63.8 | 66.3 |
Honeybee-C-13B-M256 | download | 73.2 | 1944.0 | 68.2 | 75.7 |
Honeybee-D-13B-M256 | download | 73.5 | 1950.0 | 66.6 | 72.9 |
- Pushing the limits of Honeybee (Table 7)
Model | Checkpoint | MMB | MME | SEED-I | LLaVA-w | ScienceQA |
---|---|---|---|---|---|---|
Honeybee-C-7B-M256 | download | 71.0 | 1951.3 | 65.5 | 70.6 | 93.2 |
Honeybee-C-13B-M576 | download | 73.6 | 1976.5 | 68.6 | 77.5 | 94.4 |
Evaluation
Data and Checkpoints Preparation
Please follow the official guidelines to prepare benchmark datasets: MMB, MME, SEED-Bench, ScienceQA, and OwlEval. Then, organize the data and checkpoints as follows:
data
βββ MMBench
β βββ mmbench_dev_20230712.tsv # MMBench dev split
β βββ mmbench_test_20230712.tsv # MMBench test split
β
βββ MME
β βββ OCR # Directory for OCR subtask
β βββ ...
β βββ text_translation
β
βββ SEED-Bench
β βββ SEED-Bench-image # Directory for image files
β βββ SEED-Bench.json # Annotation file
β
βββ ScienceQA
β βββ llava_test_QCM-LEPA.json # Test split annotation file
β βββ text # Directory for meta data
β β βββ pid_splits.json
β β βββ problems.json
β βββ images # Directory for image files
β βββ test
β
βββ OwlEval
βββ questions.jsonl # Question annotations
βββ images # Directory for image files
checkpoints
βββ 7B-C-Abs-M144
βββ 7B-C-Abs-M256
βββ 7B-D-Abs-M144
βββ 13B-C-Abs-M256
βββ 13B-C-Abs-M576
βββ 13B-D-Abs-M256
Evaluation
torchrun --nproc_per_node=auto --standalone eval_tasks.py \
--ckpt_path checkpoints/7B-C-Abs-M144/last \
--config \
configs/tasks/mme.yaml \
configs/tasks/mmb.yaml \
configs/tasks/seed.yaml \
configs/tasks/sqa.yaml
Strict reproduction of official results
We utilized batch inference in our evaluation to accelerate experiments. The batch inference does not significantly change average scores, but individual scores may vary slightly (about Β±0.1~0.2). To strictly reproduce the official results, the use of 8 devices (GPUs) is required; the number of devices influences batch construction, affecting the final scores.
We used the default batch size specified in each task config, except for the largest model (Honeybee-C-13B-M576
) where we used B=8 due to memory constraints.
Inference and Demo
Example code for the inference is provided in inference_example.ipynb.
The example images in ./examples
are adopted from mPLUG-Owl.
We also provide gradio demo:
python -m serve.web_server --bf16 --port {PORT} --base-model checkpoints/7B-C-Abs-M144/last
Citation
@article{cha2023honeybee,
title={Honeybee: Locality-enhanced Projector for Multimodal LLM},
author={Junbum Cha and Wooyoung Kang and Jonghwan Mun and Byungseok Roh},
journal={arXiv preprint arXiv:2312.06742},
year={2023}
}
License
The source code is licensed under Apache 2.0 License.
The pretrained weights are licensed under CC-BY-NC 4.0 License.
Acknowledgement: this project is developed based on mPLUG-Owl, which is also under the Apache 2.0 License.
Disclaimer
Kakao Brain "Honeybee" is the name of the Multimodal Large Language Model (MLLM) open source project, not the customer service brand.