feizc/Visual-LLaMA

Stars
172
Rank 221,201 (Top 5 %)
Language
Python
Created over 1 year ago
Updated over 1 year ago

feizc/Visual-LLaMA

feizc

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Open LLaMA Eyes to See the World

Open LLaMA Eyes to See the World

This project aims to optimize LLaMA model for visual information understanding like GPT-4 and further explore the potentional of large language model.

Generally, we use CLIP vision encoder to extract image features, then image features are projected with MLP-based or Transformer-based connection network into text embedding dimensionality. Then, visual representation (including additional special tokens [boi] and [eoi]) is concatenated with text representation to learn in a autoregressive manner. The framework is similar to kosmos-1 and PaLM-E.

Code adjustation to support for multi-modal generation. Download clip and LLaMA models from huggingface. Meantime, we test the scripts are also compatible with other LLaMA model size. Please use script preprocess.py to deal with the data.
Supervised training stage: freeze llama and clip-encoder models and only optimize the connection network. In this stage, we use COCO, CC-3M and COYO-700M datasets with training scripts train.py. We provide the training hyper-parameter used in our experiemnts on A100 GPU(80G). We also evaluate the image captioning performance in COCO testing set.

Argument Values

batch size 1 * 8 * 8

epochs 3

cut length 256

learning rate 4e-3

image sequence length 10
Instructing tuning stage: fine-tuning full model with mixed VQA and language-only instructing dataset. We use lora strategy to optimize the entire model with fine-tuning scripts finetune.py.

Argument Values

batch size 1024

epochs 3

cut length 256

learning rate 2e-5

image sequence length 10
Open source trained ckpt on huggingface and gradio interface for multi-model generation.

Reference

[1] https://github.com/facebookresearch/llama

[2] https://github.com/tloen/alpaca-lora

MLE-LLaMA

Multi-language Enhanced LLaMA

IEA

Image Editing Anything

DiS

Scalable Diffusion Models with State Space Backbone

Video-Stable-Diffusion

Generate consistent videos with stable diffusion models

Gradient-Free-Textual-Inversion

Gradient-Free Textual Inversion for Personalized Text-to-Image Generation

Stable-Edit

Text-based real image editing with stable diffusion models

Perceiver-Music-Generation

music generation with perceiver-ar model

DeeCap

Dynamic Early Exit for Image Captioning

Vespa

Video Diffusion State Space Models

Visual-ChatGLM

Open ChatGLM Eyes to See the World

PNAIC

Partially Non-Autoregressive Image Captioning

AIO

All In One: General Multimodal Large Language Model

Future-Caption

Efficient modeling of future context for image captioning

Meta-Ensemble

Meta-Ensemble Parameter Learning

Image-Caption-Pytorch

Pytorch implementation for image caption baseline model

UAIC

Uncertainty-away image caption generation

Dialogue-System

Multi-modal dialogue system

Latent-Dynamics

Exploring latent dynamics for visual storytelling

MaskGMT

Masked generative music transformer

Matrix-Analysis-and-Application

References and coding homework in matrix analysis and application course in UCAS

Cleaned-Webvid

Use strategy to achieve clean webvid-10m dataset

Diverse-Image-Caption

Promoting Coherence and Diversity in Image Captioning

Visual-MOSS

Makes MOSS model understand visual information

ACSG

Actor-Critic Sequence Generation for Relative Difference Captioning

LQMA

Language Quantized Masked AutoEncoders

DSC

descriptive synthetic captions in dalle3

feizc

MAIC

Memory augmented image captioning

SAIC

Semi-Autoregressive Image Captioning

arXiv-MM

Multimodal dataset for arXiv

DiffuCap

Controllable Image Captioning with Diffusion Model

Union

Unifying Language-Image Pre-training via Single-Tower Transformer

AAT

Attention-Aligned Transformer for Image Captioning

CLIP-MAE

When clip meet mae and beyond

Chinese-Image-Caption

An image captioner with Chinese language

ViD

Text-to-Image Diffusion Models as Refined Visual Learners

Meta-ViT

Meta-ensemble parameter learning for Vision Transformer

ClipCap

Incorporating CLIP features into Transformer-based image captioning

CLKA

Cross Lingual Knowledge Alignment for Stable Diffusion Models

Diffusion-Model

A tutorial of diffusion model for text-guide image generation

LLaMA-XL

LLaMA model Beyond Length Limitation

GameTag

official implementation for GameTag algorithm

MoE-MLLM

Mixture-of-Experts for Multimodal Large Language Models