π·οΈ Recognize Anything & Tag2Text
Official PyTorch Implementation of Recognize Anything: A Strong Image Tagging Model and Tag2Text: Guiding Vision-Language Model via Image Tagging.
- Recognize Anything Model(RAM) is an image tagging model, which can recognize any common category with high accuracy.
- Tag2Text is a vision-language model guided by tagging, which can support caption, retrieval and tagging.
Both Tag2Text and RAM exihibit strong recognition ability. We have combined Tag2Text and RAM with localization models (Grounding-DINO and SAM) and developed a strong visual semantic analysis pipeline in the Grounded-SAM project.
π‘ Highlight of RAM
RAM is a strong image tagging model, which can recognize any common category with high accuracy.
- Strong and general. RAM exhibits exceptional image tagging capabilities with powerful zero-shot generalization;
- RAM showcases impressive zero-shot performance, significantly outperforming CLIP and BLIP.
- RAM even surpasses the fully supervised manners (ML-Decoder).
- RAM exhibits competitive performance with the Google tagging API.
- Reproducible and affordable. RAM requires Low reproduction cost with open-source and annotation-free dataset;
- Flexible and versatile. RAM offers remarkable flexibility, catering to various application scenarios.
(Green color means fully supervised learning and Blue color means zero-shot performance.)
RAM significantly improves the tagging ability based on the Tag2text framework.
- Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, higher accuracy compared to Tag2Text.
- Scope. RAM upgrades the number of fixed tags from 3,400+ to 6,400+ (synonymous reduction to 4,500+ different semantic tags), covering more valuable categories. Moreover, RAM is equipped with open-set capability, feasible to recognize tags not seen during training
π Highlight of Tag2text
Tag2Text is an efficient and controllable vision-language model with tagging guidance.
- Tagging. Tag2Text recognizes 3,400+ commonly human-used categories without manual annotations.
- Captioning. Tag2Text integrates tags information into text generation as the guiding elements, resulting in more controllable and comprehensive descriptions.
- Retrieval. Tag2Text provides tags as additional visible alignment indicators for image-text retrieval.
βοΈ TODO
- Release Tag2Text demo.
- Release checkpoints.
- Release inference code.
- Release RAM demo and checkpoints.
- Release training codes.
- Release training datasets.
π§° Checkpoints
Name | Backbone | Data | Illustration | Checkpoint | |
---|---|---|---|---|---|
1 | RAM-14M | Swin-Large | COCO, VG, SBU, CC-3M, CC-12M | Provide strong image tagging ability. | Download link |
2 | Tag2Text-14M | Swin-Base | COCO, VG, SBU, CC-3M, CC-12M | Support comprehensive captioning and tagging. | Download link |
π Model Inference
Setting Up
- Install recognize-anything as a package:
pip install git+https://github.com/xinyu1205/recognize-anything.git
Then the RAM and Tag2Text model can be imported in other projects:
from ram.models import ram, tag2text
RAM Inference
Get the English and Chinese outputs of the images:
python inference_ram.py --image images/demo/demo1.jpg
--pretrained pretrained/ram_swin_large_14m.pth
RAM Inference on Unseen Categories (Open-Set)
Firstly, custom recognition categories in build_openset_label_embedding, then get the tags of the images:
python inference_ram_openset.py --image images/openset_example.jpg
--pretrained pretrained/ram_swin_large_14m.pth
Tag2Text Inference
Get the tagging and captioning results:
python inference_tag2text.py --image images/demo/demo1.jpgOr get the tagging and sepcifed captioning results (optional):
--pretrained pretrained/tag2text_swin_14m.pth
python inference_tag2text.py --image images/demo/demo1.jpg
--pretrained pretrained/tag2text_swin_14m.pth
--specified-tags "cloud,sky"
Batch Inference and Evaluation
We release two datasets OpenImages-common
(214 seen classes) and OpenImages-rare
(200 unseen classes). Copy or sym-link test images of OpenImages v6 to datasets/openimages_common_214/imgs/
and datasets/openimages_rare_200/imgs
.
To evaluate RAM on OpenImages-common
:
python batch_inference.py \
--model-type ram \
--checkpoint pretrained/ram_swin_large_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/ram
To evaluate RAM open-set capability on OpenImages-rare
:
python batch_inference.py \
--model-type ram \
--checkpoint pretrained/ram_swin_large_14m.pth \
--open-set \
--dataset openimages_rare_200 \
--output-dir outputs/ram_openset
To evaluate Tag2Text on OpenImages-common
:
python batch_inference.py \
--model-type tag2text \
--checkpoint pretrained/tag2text_swin_14m.pth \
--dataset openimages_common_214 \
--output-dir outputs/tag2text
Please refer to batch_inference.py
for more options. To get P/R in table 3 of our paper, pass --threshold=0.86
for RAM and --threshold=0.68
for Tag2Text.
To batch inference custom images, you can set up you own datasets following the given two datasets.
ποΈ Model Training/Finetuning
Tag2Text
At present, we can only open source the forward function of Tag2Text as much as possible. To train/finetune Tag2Text on a custom dataset, you can refer to the complete training codebase of BLIP and make the following modifications:
- Replace the "models/blip.py" file with the current "tag2text.py" model file;
- Load additional tags based on the original dataloader.
RAM
The training code of RAM cannot be open-sourced temporarily as it is in the company's process.
βοΈ Citation
If you find our work to be useful for your research, please consider citing.
@article{zhang2023recognize,
title={Recognize Anything: A Strong Image Tagging Model},
author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
journal={arXiv preprint arXiv:2306.03514},
year={2023}
}
@article{huang2023tag2text,
title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
journal={arXiv preprint arXiv:2303.05657},
year={2023}
}
β₯οΈ Acknowledgements
This work is done with the help of the amazing code base of BLIP, thanks very much!
We want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying RAM/Tag2Text with Grounded-SAM.
We also want to thank Ask-Anything, Prompt-can-anything for combining RAM/Tag2Text, which greatly expands the application boundaries of RAM/Tag2Text.