🏷️ Recognize Anything & Tag2Text

Official PyTorch Implementation of Recognize Anything: A Strong Image Tagging Model and Tag2Text: Guiding Vision-Language Model via Image Tagging.

Recognize Anything Model(RAM) is an image tagging model, which can recognize any common category with high accuracy.
Tag2Text is a vision-language model guided by tagging, which can support caption, retrieval and tagging.

Both Tag2Text and RAM exihibit strong recognition ability. We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.

💡 Highlight of RAM

RAM is a strong image tagging model, which can recognize any common category with high accuracy.

Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
- RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
- RAM even surpasses the fully supervised manners (ML-Decoder).
- RAM exhibits competitive performance with the Google tagging API.
Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset;
Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.

(Green color means fully supervised learning and Blue color means zero-shot performance.)

RAM significantly improves the tagging ability based on the Tag2text framework.

Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, higher accuracy compared to Tag2Text.
Scope. RAM upgrades the number of fixed tags from 3,400+ to 6,400+ (synonymous reduction to 4,500+ different semantic tags), covering more valuable categories. Moreover, RAM is equipped with open-set capability, feasible to recognize tags not seen during training

🌅 Highlight of Tag2text

Tag2Text is an efficient and controllable vision-language model with tagging guidance.

Tagging. Tag2Text recognizes 3,400+ commonly human-used categories without manual annotations.
Captioning. Tag2Text integrates tags information into text generation as the guiding elements, resulting in more controllable and comprehensive descriptions.
Retrieval. Tag2Text provides tags as additional visible alignment indicators for image-text retrieval.

✍️ TODO

🧰 Checkpoints

	Name	Backbone	Data	Illustration	Checkpoint
1	RAM-14M	Swin-Large	COCO, VG, SBU, CC-3M, CC-12M	Provide strong image tagging ability.	Download link
2	Tag2Text-14M	Swin-Base	COCO, VG, SBU, CC-3M, CC-12M	Support comprehensive captioning and tagging.	Download link

🏃 Model Inference

Setting Up

Install recognize-anything as a package:

pip install git+https://github.com/xinyu1205/recognize-anything.git

Then the RAM and Tag2Text model can be imported in other projects:

from ram.models import ram, tag2text

RAM Inference

Get the English and Chinese outputs of the images:

python inference_ram.py  --image images/demo/demo1.jpg 

--pretrained pretrained/ram_swin_large_14m.pth

RAM Inference on Unseen Categories (Open-Set)

Firstly, custom recognition categories in build_openset_label_embedding, then get the tags of the images:

python inference_ram_openset.py  --image images/openset_example.jpg 

--pretrained pretrained/ram_swin_large_14m.pth

Tag2Text Inference

Get the tagging and captioning results:

python inference_tag2text.py  --image images/demo/demo1.jpg 

--pretrained pretrained/tag2text_swin_14m.pth

Or get the tagging and sepcifed captioning results (optional):

python inference_tag2text.py  --image images/demo/demo1.jpg 

--pretrained pretrained/tag2text_swin_14m.pth 

--specified-tags "cloud,sky"

Batch Inference and Evaluation

We release two datasets OpenImages-common (214 seen classes) and OpenImages-rare (200 unseen classes). Copy or sym-link test images of OpenImages v6 to datasets/openimages_common_214/imgs/ and datasets/openimages_rare_200/imgs.

To evaluate RAM on OpenImages-common:

python batch_inference.py \
  --model-type ram \
  --checkpoint pretrained/ram_swin_large_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/ram

To evaluate RAM open-set capability on OpenImages-rare:

python batch_inference.py \
  --model-type ram \
  --checkpoint pretrained/ram_swin_large_14m.pth \
  --open-set \
  --dataset openimages_rare_200 \
  --output-dir outputs/ram_openset

To evaluate Tag2Text on OpenImages-common:

python batch_inference.py \
  --model-type tag2text \
  --checkpoint pretrained/tag2text_swin_14m.pth \
  --dataset openimages_common_214 \
  --output-dir outputs/tag2text

Please refer to batch_inference.py for more options. To get P/R in table 3 of our paper, pass --threshold=0.86 for RAM and --threshold=0.68 for Tag2Text.

To batch inference custom images, you can set up you own datasets following the given two datasets.

🏌️ Model Training/Finetuning

Tag2Text

At present, we can only open source the forward function of Tag2Text as much as possible. To train/finetune Tag2Text on a custom dataset, you can refer to the complete training codebase of BLIP and make the following modifications:

Replace the "models/blip.py" file with the current "tag2text.py" model file;
Load additional tags based on the original dataloader.

RAM

The training code of RAM cannot be open-sourced temporarily as it is in the company's process.

✒️ Citation

If you find our work to be useful for your research, please consider citing.

@article{zhang2023recognize,
  title={Recognize Anything: A Strong Image Tagging Model},
  author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
  journal={arXiv preprint arXiv:2306.03514},
  year={2023}
}

@article{huang2023tag2text,
  title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
  author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
  journal={arXiv preprint arXiv:2303.05657},
  year={2023}
}

♥️ Acknowledgements

This work is done with the help of the amazing code base of BLIP, thanks very much!

We want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying RAM/Tag2Text with Grounded-SAM.

We also want to thank Ask-Anything, Prompt-can-anything for combining RAM/Tag2Text, which greatly expands the application boundaries of RAM/Tag2Text.

xinyu1205/recognize-anything

xinyu1205

Reviews

Repository Details