Segment Anything with Clip
[HuggingFace Space] | [COLAB] | [Demo Video]
Meta released a new foundation model for segmentation tasks. It aims to resolve downstream segmentation tasks with prompt engineering, such as foreground/background points, bounding box, mask, and free-formed text. However, the text prompt is not released yet.
Alternatively, I took the following steps:
- Get all object proposals generated by SAM (Segment Anything Model).
- Crop the object regions by bounding boxes.
- Get cropped images' features and a query feature from CLIP.
- Calculate the similarity between image features and the query feature.
# How to get the similarity.
preprocessed_img = preprocess(crop).unsqueeze(0)
tokens = clip.tokenize(texts)
logits_per_image, _ = model(preprocessed_img, tokens)
similarity = logits_per_image.softmax(-1)
How to run on local
Anaconda is required before start setup.
make env
conda activate segment-anything-with-clip
make setup
# this executes GRadio server.
make run
Successive Works
- Fast Segment Everything: Re-implemented Everything algorithm in iterative manner that is better for CPU only environments. It shows comparable results to the original Everything within 1/5 number of inferences (e.g. 1024 vs 200), and it takes under 10 seconds to search for masks on a
CPU upgrade
instance (8 vCPU, 32GB RAM) of Huggingface space. - Fast Segment Everything with Text Prompt: This example based on Fast-Segment-Everything provides a text prompt that generates an attention map for the area you want to focus on.
- Fast Segment Everything with Image Prompt: This example based on Fast-Segment-Everything provides an image prompt that generates an attention map for the area you want to focus on.
- Fast Segment Everything with Drawing Prompt: This example based on Fast-Segment-Everything provides a drawing prompt that generates an attention map for the area you want to focus on.