• Stars
    star
    252
  • Rank 160,412 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 1 year ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"

LLaVAR

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, Tong Sun

Project Page

Arxiv Link

Demo

alt text

@misc{zhang2023llavar,
    title={LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding},
    author={Yanzhe Zhang and Ruiyi Zhang and Jiuxiang Gu and Yufan Zhou and Nedim Lipka and Diyi Yang and Tong Sun},
    year={2023},
    eprint={2306.17107},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

[UPDATE 07/21] Release the metadata of used LAION images: pretrain/finetune.

[UPDATE 07/12] Release the OCR evaluation results/script on the MME benchmark. LLaVAR increases the OCR score of LLaVA from 50 to 80.

[UPDATE 07/05] Data available on Huggingface 🤗.

[UPDATE 07/05] Model Weight Delta on Huggingface 🤗.

[UPDATE 06/29] Initial Release.

The main difference between our code and LLaVA's code is that we modified the training/testing/serving files to support Vicuna v1.1, which uses '</s>' as the separator instead of '###'.

Environment Setup

Please prepare the environment/merge the model weight following LLaVA.

Model Weight Delta: Google Drive, Huggingface

This should be merged with LLaMA-13B.

After merging, please add "v1" to your folder name and make sure the conversation mode "llava_v1" is used.

Training Data (Huggingface)

Our image data is already transformed into the format of LLaVA pretraining/finetuning (They have "fake" file names in the format of CC3M and COCO). You can download them and merge them into the LLaVA training sets.

Our instructions, on the other hand, already contain LLaVA's instructions.

Pretraining Images: Google Drive

Pretraining Instructions (595K + 422K): Google Drive

Finetuning Images: Google Drive

Finetuning Instructions (158K + 16K): Google Drive

Finetuning Instructions (158K + 20K): Google Drive

Evaluation Data (Huggingface)

We collect 50 instruction-following questions and answers on 50 text-rich images from LAION, which can be leveraged for GPT-4-based instruction-following Evaluation.

Evaluation Images: Google Drive

GPT-4 Evaluation Contexts (595K + 422K): File

GPT-4 Evaluation Rules: File

Questions: File

GPT-4 Answers: File

Training Script

You should merge our pretraining images into the cc3m folder.

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
   /path/to/LLaVA/llava/train/train_mem.py \
    --model_name_or_path /path/to/models/vicuna_13b_v1_1 \
    --data_path /path/to/chat_llavar.json \
    --image_folder /path/to/cc3m \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end \
    --bf16 True \
    --output_dir /path/to/checkpoint \
    --num_train_epochs 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 4000 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 1024 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --image_aspect_ratio 'pad' \
    --report_to wandb

You should merge our finetuning images into the coco2017 folder.

torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
    /path/to/LLaVA/llava/train/train_mem.py \
    --model_name_or_path /path/to/models/vicuna_13b_v1_1 \
    --data_path /path/to/llava_instruct_150k_llavar_16k.json \
    --image_folder /path/to/coco/images/train2017 \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter /path/to/mm_proj/llava-13b-pretrain.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end True \
    --bf16 True \
    --output_dir /path/to/checkpoint \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 8000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --image_aspect_ratio 'pad' \
    --report_to wandb

Evaluation Script

Instruction-following on COCO images.

python /path/to/LLaVA/llava/eval/model_vqa.py \
    --model-name /path/to/checkpoint \
    --question-file \
    /path/to/LLaVA/playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
    --image-folder \
    /path/to/coco2014/val2014 \
    --answers-file \
    /path/to/qa90-answer-file.jsonl \
    --conv-mode "llava_v1"

Instruction-following on a given image URL.

python -m llava.eval.run_llava \
    --model-name /path/to/checkpoint \
    --image-file "https://cdn.shopify.com/s/files/1/0057/3728/3618/products/a-man-called-otto_ezrjr0pm_480x.progressive.jpg" \
    --query "Who starred in the movie?"

For text-based VQA (from MultimodalOCR): after cloning their repo and preparing the data, you can put the ./MultimodalOCR/Eval_LLaVAR.py in /your/path/to/MultimodalOCR/models/LLaVA/ and add our model to /your/path/to/MultimodalOCR/eval.py for evaluation.

Acknowledgement

The code base is mainly from the LLaVA project. Our evaluation is also built on the MultimodalOCR project.

For better language decoder, you can also pay attention to the recent Vicuna model update.

@article{liu2023llava,
    author      = {Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
    title       = {Visual Instruction Tuning},
    publisher   = {arXiv:2304.08485},
    year        = {2023}
}

@misc{liu2023hidden,
    title={On the Hidden Mystery of OCR in Large Multimodal Models},
    author={Yuliang Liu and Zhang Li and Hongliang Li and Wenwen Yu and Yang Liu and Biao Yang and Mingxin Huang and Dezhi Peng and Mingyu Liu and Mingrui Chen and Chunyuan Li and Xucheng Yin and Cheng-lin Liu and Lianwen Jin and Xiang Bai},
    year={2023},
    eprint={2305.07895},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

@misc{vicuna2023,
    title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality},
    url = {https://lmsys.org/blog/2023-03-30-vicuna/},
    author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
    month = {March},
    year = {2023}
}

More Repositories

1

MixText

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification
Jupyter Notebook
349
star
2

demonstrated-feedback

Python
104
star
3

DyLAN

Official Implementation of Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
Python
90
star
4

Multi-View-Seq2Seq

Source codes for the paper "Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization"
Python
89
star
5

Structure-Aware-BART

Source codes for "Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs"
Python
64
star
6

positive-frames

Data and code for the paper "Inducing Positive Perspectives with Text Reframing"
Python
52
star
7

LADA

Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"
Python
44
star
8

FLANG

When FLUE Meets FLANG: Benchmarks and Large Pretrained Language Model for Financial Domain
Python
43
star
9

IDBR

Codes for the paper: "Continual Learning for Text Classification with Information Disentanglement Based Regularization"
Python
42
star
10

Adaptive-Compositional-Modules

Code for the ACL 2022 paper "Continual Sequence Generation with Adaptive Compositional Modules"
Python
38
star
11

implicit-hate

35
star
12

Efficient_Unlearning

Python
30
star
13

CultureBank

Python
28
star
14

Bound-Cap-LLM

Source codes for the paper "Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints"
Python
27
star
15

chain-of-thought-bias

Jupyter Notebook
23
star
16

normbank

Data and code for the paper "NormBank: A Knowledge Bank of Situational Social Norms"
Jupyter Notebook
19
star
17

Persuasive-Orderings

Source codes for the paper "Examining the Ordering of Rhetorical Strategies in Persuasive Requests"
Jupyter Notebook
17
star
18

mic

Data and code for the paper "The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems"
Jupyter Notebook
16
star
19

LLMs_for_CSS

Jupyter Notebook
16
star
20

Disfluency-Generation-and-Detection

Code for "Planning and Generating Natural and Diverse Disfluent Texts as Augmentation for Disfluency Detection"
Python
15
star
21

CoAnnotating

This is the official repository for "CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation"
Jupyter Notebook
13
star
22

multi-value

Complete set of English dialect transformation rules and evaluation code
Python
13
star
23

Persuasion_Strategy_WVAE

Python
13
star
24

PersuationGames

Source codes for the paper "Werewolf Among Us: Multimodal Resources for Modeling Persuasion Behaviors in Social Deduction Games"
Python
11
star
25

Impressions

Dataset for the investigation of visual semiotics, and how specific visual features and design choices can elicit specific emotions, thoughts and beliefs.
Python
10
star
26

CODA

Simple Conversational Data Augmentation for Semi-supervised Abstractive Conversation Summarization
Jupyter Notebook
10
star
27

DARG

The official repo for DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
Python
9
star
28

FormalityStyleTransfer

Code for "Semi-supervised Formality Style Transfer using Language Model Discriminator and Mutual Information Maximization"
Python
8
star
29

RobustDemo

Source codes for the paper "Robustness of Demonstration-based Learning Under Limited Data Scenario"
Python
8
star
30

SUBS

Python
7
star
31

value

Data and code for the paper "VALUE: Understanding Dialect Disparity in NLU"
Python
7
star
32

DADA

This is the oficial repository for "DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules".
Python
6
star
33

DAMP

Baselines for TOP V2, MTOP, Hinglish TOP and CSTOP
Python
6
star
34

Parenting_OnlineUsage

Code and dataset fo paper: Understanding the Usage of Online Media for Parenting from Infancy to Preschool At Scale
Jupyter Notebook
5
star
35

Guided-Adversarial-Augmentation

Jupyter Notebook
5
star
36

toxicity-explanation

Python
5
star
37

framing-police-violence

Jupyter Notebook
5
star
38

HiddenCut

Python
5
star
39

Multilingual-DRS-Semantic-Parsing

Python
4
star
40

codenames

Python
4
star
41

Social-Intelligence-Data-Infrastructure

This is the official repository for the paper: Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future
HTML
4
star
42

unintended-impacts-of-alignment

Official Repository for the ACL 2024 Paper: Unintended Impacts of LLM Alignment on Global Representation
Python
4
star
43

CulturallyAwareNLI

Culturally Aware Natural Language Inference
2
star
44

GEP_data

Code/Data for the paper: "Auditing Gender Presentation Differences in Text-to-Image Models"
Jupyter Notebook
2
star
45

Incivility

A Crisis of Civility? Modeling Incivility and Its Effects in Political Discourse Online
Jupyter Notebook
1
star
46

Design2Code

JavaScript
1
star
47

DiVA-Eval

Python
1
star
48

PrivacyLens

A data construction and evaluation framework to quantify privacy norm awareness of language models (LMs) and emerging privacy risk of LM agents.
Python
1
star
49

contentiousness

Source code for the paper: "Linguistic Characterization of Divisive Topics ONline: Case Studies on Contentiousness in Abortion, Climate Change, and Gun Control
Python
1
star