• Stars
    star
    172
  • Rank 221,201 (Top 5 %)
  • Language
    Python
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Open LLaMA Eyes to See the World

logo

Open LLaMA Eyes to See the World

This project aims to optimize LLaMA model for visual information understanding like GPT-4 and further explore the potentional of large language model.

Generally, we use CLIP vision encoder to extract image features, then image features are projected with MLP-based or Transformer-based connection network into text embedding dimensionality. Then, visual representation (including additional special tokens [boi] and [eoi]) is concatenated with text representation to learn in a autoregressive manner. The framework is similar to kosmos-1 and PaLM-E.

  • Code adjustation to support for multi-modal generation. Download clip and LLaMA models from huggingface. Meantime, we test the scripts are also compatible with other LLaMA model size. Please use script preprocess.py to deal with the data.

  • Supervised training stage: freeze llama and clip-encoder models and only optimize the connection network. In this stage, we use COCO, CC-3M and COYO-700M datasets with training scripts train.py. We provide the training hyper-parameter used in our experiemnts on A100 GPU(80G). We also evaluate the image captioning performance in COCO testing set.

    Argument Values
    batch size 1 * 8 * 8
    epochs 3
    cut length 256
    learning rate 4e-3
    image sequence length 10
  • Instructing tuning stage: fine-tuning full model with mixed VQA and language-only instructing dataset. We use lora strategy to optimize the entire model with fine-tuning scripts finetune.py.

    Argument Values
    batch size 1024
    epochs 3
    cut length 256
    learning rate 2e-5
    image sequence length 10
  • Open source trained ckpt on huggingface and gradio interface for multi-model generation.

Reference

[1] https://github.com/facebookresearch/llama

[2] https://github.com/tloen/alpaca-lora

More Repositories

1

MLE-LLaMA

Multi-language Enhanced LLaMA
Python
301
star
2

IEA

Image Editing Anything
Python
107
star
3

DiS

Scalable Diffusion Models with State Space Backbone
Python
101
star
4

Video-Stable-Diffusion

Generate consistent videos with stable diffusion models
Python
45
star
5

Gradient-Free-Textual-Inversion

Gradient-Free Textual Inversion for Personalized Text-to-Image Generation
Python
33
star
6

Stable-Edit

Text-based real image editing with stable diffusion models
Python
25
star
7

Perceiver-Music-Generation

music generation with perceiver-ar model
Python
24
star
8

DeeCap

Dynamic Early Exit for Image Captioning
Python
16
star
9

Vespa

Video Diffusion State Space Models
Python
15
star
10

Visual-ChatGLM

Open ChatGLM Eyes to See the World
Python
13
star
11

PNAIC

Partially Non-Autoregressive Image Captioning
Python
10
star
12

AIO

All In One: General Multimodal Large Language Model
Python
9
star
13

Future-Caption

Efficient modeling of future context for image captioning
Python
8
star
14

Meta-Ensemble

Meta-Ensemble Parameter Learning
Python
8
star
15

Image-Caption-Pytorch

Pytorch implementation for image caption baseline model
Python
8
star
16

UAIC

Uncertainty-away image caption generation
Python
7
star
17

Dialogue-System

Multi-modal dialogue system
Python
5
star
18

Latent-Dynamics

Exploring latent dynamics for visual storytelling
Python
4
star
19

MaskGMT

Masked generative music transformer
Python
4
star
20

Matrix-Analysis-and-Application

References and coding homework in matrix analysis and application course in UCAS
Python
3
star
21

Cleaned-Webvid

Use strategy to achieve clean webvid-10m dataset
Python
3
star
22

Diverse-Image-Caption

Promoting Coherence and Diversity in Image Captioning
Python
3
star
23

Visual-MOSS

Makes MOSS model understand visual information
Python
3
star
24

ACSG

Actor-Critic Sequence Generation for Relative Difference Captioning
2
star
25

LQMA

Language Quantized Masked AutoEncoders
Python
2
star
26

DSC

descriptive synthetic captions in dalle3
2
star
27

feizc

2
star
28

MAIC

Memory augmented image captioning
Python
2
star
29

SAIC

Semi-Autoregressive Image Captioning
2
star
30

arXiv-MM

Multimodal dataset for arXiv
Python
2
star
31

DiffuCap

Controllable Image Captioning with Diffusion Model
2
star
32

Union

Unifying Language-Image Pre-training via Single-Tower Transformer
Python
2
star
33

AAT

Attention-Aligned Transformer for Image Captioning
Python
2
star
34

CLIP-MAE

When clip meet mae and beyond
Python
2
star
35

Chinese-Image-Caption

An image captioner with Chinese language
Python
2
star
36

ViD

Text-to-Image Diffusion Models as Refined Visual Learners
Python
1
star
37

Meta-ViT

Meta-ensemble parameter learning for Vision Transformer
Python
1
star
38

ClipCap

Incorporating CLIP features into Transformer-based image captioning
Python
1
star
39

CLKA

Cross Lingual Knowledge Alignment for Stable Diffusion Models
Python
1
star
40

Diffusion-Model

A tutorial of diffusion model for text-guide image generation
Python
1
star
41

LLaMA-XL

LLaMA model Beyond Length Limitation
1
star
42

GameTag

official implementation for GameTag algorithm
Python
1
star
43

MoE-MLLM

Mixture-of-Experts for Multimodal Large Language Models
Python
1
star