• Stars
    star
    184
  • Rank 209,187 (Top 5 %)
  • Language
    Jupyter Notebook
  • Created over 3 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Search photos on Unsplash based on OpenAI's CLIP model, support search with joint image+text queries and attention visualization.

natural-language-joint-query-search

In the project, we support multiple types of query search including text-image, image-image, text2-image, and text+image-image. In order to analyze the result of retrieved images, we also support visualization of text attention. The attention of image will be supported soon!

Colab Demo

Search photos on Unsplash, support for joint image+text queries search.

Open In Colab

Attention visualization of CLIP.

Open In Colab

Usage

We follow the same environment as the CLIP project:

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm

To visualize the attention of CLIP, we slightly modify the code of CLIP as mention here, so you don't have to install CLIP via official command. An open-sourced visualization tool is used in our project, you need to clone it into this repo.

$ git clone https://github.com/shashwattrivedi/Attention_visualizer.git

Download the pre-extracted image id and features of Unsplash dataset from Google Drive or just run the following commands, and put them under unsplash-dataset dir, details can be found in natural-language-image-search project.

from pathlib import Path

# Create a folder for the precomputed features
!mkdir unsplash-dataset

# Download from Github Releases
if not Path('unsplash-dataset/photo_ids.csv').exists():
  !wget https://github.com/haltakov/natural-language-image-search/releases/download/1.0.0/photo_ids.csv -O unsplash-dataset/photo_ids.csv

if not Path('unsplash-dataset/features.npy').exists():
  !wget https://github.com/haltakov/natural-language-image-search/releases/download/1.0.0/features.npy -O unsplash-dataset/features.npy

Example of joint query search.

import torch
import numpy as np
import pandas as pd
from PIL import Image

from CLIP.clip import clip

def encode_search_query(search_query):
    with torch.no_grad():
        text_encoded, weight = model.encode_text(clip.tokenize(search_query).to(device))
        text_encoded /= text_encoded.norm(dim=-1, keepdim=True)
        return text_encoded.cpu().numpy()

def find_best_matches(text_features, photo_features, photo_ids, results_count):
  similarities = (photo_features @ text_features.T).squeeze(1)
  best_photo_idx = (-similarities).argsort()
  return [photo_ids[i] for i in best_photo_idx[:results_count]]

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

photo_ids = pd.read_csv("unsplash-dataset/photo_ids.csv")
photo_ids = list(photo_ids['photo_id'])
photo_features = np.load("unsplash-dataset/features.npy")

# text to image
search_query = "Tokyo Tower at night."
text_features = model.encode_search_query(search_query)
best_photo_ids = find_best_matches(text_features, photo_features, photo_ids, 5)
for photo_id in best_photo_ids:
  print("https://unsplash.com/photos/{}/download".format(photo_id))

# image to image
source_image = "images/borna-hrzina-8IPrifbjo-0-unsplash.jpg"
with torch.no_grad():
  image_feature = model.encode_image(preprocess(Image.open(source_image)).unsqueeze(0).to(device))
  image_feature = (image_feature / image_feature.norm(dim=-1, keepdim=True)).cpu().numpy()
best_photo_ids = find_best_matches(image_feature, photo_features, photo_ids, 5)
for photo_id in best_photo_ids:
  print("https://unsplash.com/photos/{}/download".format(photo_id))

# text+text to image
search_query = "red flower"
search_query_extra = "blue sky"
text_features = encode_search_query(search_query)
text_features_extra = encode_search_query(search_query_extra)
mixed_features = text_features + text_features_extra
best_photo_ids = find_best_matches(mixed_features, photo_features, photo_ids, 5)
for photo_id in best_photo_ids:
  print("https://unsplash.com/photos/{}/download".format(photo_id))

# image+text to image
search_image = "images/borna-hrzina-8IPrifbjo-0-unsplash.jpg"
search_text = "cars"
with torch.no_grad():
  image_feature = model.encode_image(preprocess(Image.open(search_image)).unsqueeze(0).to(device))
  image_feature = (image_feature / image_feature.norm(dim=-1, keepdim=True)).cpu().numpy()
text_feature = encode_search_query(search_text)
modified_feature = image_feature + text_feature
best_photo_ids = find_best_matches(modified_feature, photo_features, photo_ids, 5)
for photo_id in best_photo_ids:
  print("https://unsplash.com/photos/{}/download".format(photo_id))

Example of CLIP attention visualization. You can know which keywords does CLIP use to retrieve the results. To be convenient, all punctuations are removed.

import torch
import numpy as np
import pandas as pd
from PIL import Image

from CLIP.clip import clip
from CLIP.clip import model

from Attention_visualizer.attention_visualizer import *

def find_best_matches(text_features, photo_features, photo_ids, results_count):
  similarities = (photo_features @ text_features.T).squeeze(1)
  best_photo_idx = (-similarities).argsort()
  return [photo_ids[i] for i in best_photo_idx[:results_count]]

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device, jit=False)

photo_ids = pd.read_csv("unsplash-dataset/photo_ids.csv")
photo_ids = list(photo_ids['photo_id'])
photo_features = np.load("unsplash-dataset/features.npy")

search_query = "A red flower is under the blue sky and there is a bee on the flower"

with torch.no_grad():
    text_token = clip.tokenize(search_query).to(device)
    text_encoded, weight = model.encode_text(text_token)
    text_encoded /= text_encoded.norm(dim=-1, keepdim=True)

text_features = text_encoded.cpu().numpy()
best_photo_ids = find_best_matches(text_features, photo_features, photo_ids, 5)

for photo_id in best_photo_ids:
  print("https://unsplash.com/photos/{}/download".format(photo_id))

sentence = search_query.split(" ")
attention_weights = list(weight[-1][0][1+len(sentence)].cpu().numpy())[:2+len(sentence)][1:][:-1]
attention_weights = [float(item) for item in attention_weights]
display_attention(sentence,attention_weights)

You can also run these example on Colab via joint-query-search and clip-attention.

Example

Text-to-Image

"Tokyo tower at night."

Search results for "Tokyo tower at night."

"People come and go on the street."

Search results for "People come and go on the street."

Image-to-Image

A normal street view. (The left side is the source image)

Search results for a street view image

Text+Text-to-Image

"Flower" + "Blue sky"

Search results for "flower" and "blue sky"

"Flower" + "Bee"

Search results for "flower" and "bee"

Image+Text-to-Image

A normal street view + "cars"

Search results for an empty street with query "cars"

Visualization

"A woman holding an umbrella standing next to a man in a rainy day"

Search results for "A woman holding an umbrella standing next to a man in a rainy day"

"umbrella", "standing" and "rainy" receive the most of attention.

"A red flower is under the blue sky and there is a bee on the flower"

Search results for "A red flower is under the blue sky and there is a bee on the flower"

"flower", "sky" and "bee" receive the most of attention.

Acknowledgements

Search photos on Unsplash using natural language descriptions. The search is powered by OpenAI's CLIP model and the Unsplash Dataset. This project is mostly based on natural-language-image-search.

This project was inspired by these projects:

More Repositories

1

ControlNet-for-Diffusers

Transfer the ControlNet with any basemodel in diffusers🔥
Python
743
star
2

Lora-for-Diffusers

The most easy-to-understand tutorial for using LoRA (Low-Rank Adaptation) within diffusers framework for AI Generation Researchers🔥
Python
684
star
3

Score-CAM

Official implementation of Score-CAM in PyTorch
Python
379
star
4

inswapper

One-click Face Swapper and Restoration powered by insightface 🔥
Python
327
star
5

awesome-conditional-content-generation

Update-to-data resources for conditional content generation, including human motion generation, image or video generation and editing.
212
star
6

Awesome-Computer-Vision

Awesome Resources for Advanced Computer Vision Topics
209
star
7

video-swin-transformer-pytorch

Video Swin Transformer - PyTorch
Python
188
star
8

T2I-Adapter-for-Diffusers

Transfer the T2I-Adapter with any basemodel in diffusers🔥
125
star
9

CLIFF

This repo equips the official CLIFF [ECCV 2022 Oral] with better detector, better tracker. Support multi-person, motion interpolation, motion smooth and SMPLify fitting.
Python
113
star
10

awesome-mlp-papers

Recent Advances in MLP-based Models (MLP is all you need!)
110
star
11

accurate-head-pose

Pytorch code for Hybrid Coarse-fine Classification for Head Pose Estimation
Python
97
star
12

Train-ControlNet-in-Diffusers

We show you how to train a ControlNet with your own control hint in diffusers framework
52
star
13

mxnet-Head-Pose

An MXNet implementation of Fine-Grained Head Pose
Python
47
star
14

cropimage

A simple toolkit for detecting and cropping main body from pictures. Support face and saliency detection.
Python
34
star
15

awesome-vision-language-modeling

Recent Advances in Vision-Language Pre-training!
25
star
16

visbeat3

Python3 Implementation for 'Visual Rhythm and Beat' SIGGRAPH 2018
Python
16
star
17

DWPose

Inference code for DWCode
Python
15
star
18

Multi-Frame-Rendering-in-Diffusers

7
star
19

stable-diffusion-xl-handbook

6
star
20

mmdet_benchmark

mmdetection、mmdeploy 中的 Mask R-CNN 深度优化
Python
5
star
21

Anime-Facial-Landmarks

Python
4
star
22

lora-block-weight-diffusers

When applying Lora, strength can be set block by block. Support for diffusers framework.
Python
3
star
23

mxnet-Hand-Detection

A simple headmap regression for hand detection
Python
2
star
24

CS188-Project

CS188 Project Fall 2017 Berkeley
Python
1
star
25

pytorch-distributed-training

A simple cookbook for DDP training in Pytorch
Python
1
star
26

KGRN-SR

Official Implementation for Knowledge Graph Routed Network for Situation Recognition [TPAMI'2023]
Python
1
star
27

SD3-diffusers

Stable Diffusion 3 in diffusers
1
star