• Stars
    star
    149
  • Rank 248,619 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created about 3 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Concept Modeling: Topic Modeling on Images and Text

PyPI - Python PyPI - PyPi docs PyPI - License Open In Colab

Concept

Concept is a technique that leverages CLIP and BERTopic-based techniques to perform Concept Modeling on images.

Since topics are part of conversations and text, they do not represent the context of images well. Therefore, these clusters of images are referred to as 'Concepts' instead of the traditional 'Topics'.

Thus, Concept Modeling takes inspiration from topic modeling techniques to cluster images, find common concepts and model them both visually using images and textually using topic representations.

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install concept

Quick Start

First, we need to download and extract 25.000 images from Unsplash used in the sentence-transformers example:

import os
import glob
import zipfile
from tqdm import tqdm
from sentence_transformers import util

# 25k images from Unsplash
img_folder = 'photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)
    
    photo_filename = 'unsplash-25k-photos.zip'
    if not os.path.exists(photo_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename)
        
    #Extract all images
    with zipfile.ZipFile(photo_filename, 'r') as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, img_folder)
img_names = list(glob.glob('photos/*.jpg'))

Next, we only need to pass images to Concept:

from concept import ConceptModel
concept_model = ConceptModel()
concepts = concept_model.fit_transform(img_names)

The resulting concepts can be visualized through concept_model.visualize_concepts():

However, to get the full experience, we need to label the concept clusters with topics. To do this, we need to create a vocabulary. We are going to feed our model with 50.000 nouns from the English vocabulary:

import random
import nltk
nltk.download("wordnet")
from nltk.corpus import wordnet as wn

all_nouns = [word for synset in wn.all_synsets('n') for word in synset.lemma_names() if "_" not in word]
selected_nouns = random.sample(all_nouns, 50_000)

Then, we can pass in the resulting selected_nouns to Concept:

from concept import ConceptModel

concept_model = ConceptModel()
concepts = concept_model.fit_transform(img_names, docs=selected_nouns)

Again, the resulting concepts can be visualized. This time however, we can also see the generated topics through concept_model.visualize_concepts():

NOTE: Use Concept(embedding_model="clip-ViT-B-32-multilingual-v1") to select a model that supports 50+ languages.

Search Concepts

We can quickly search for specific concepts by embedding a search term and finding the cluster embeddings that best represent them. As an example, let us search for the term beach and see what we can find. To do this, we simply run the following:

>>> concept_model.find_concepts("beach")
[(100, 0.277577825349102),
 (53, 0.27431058773894657),
 (95, 0.25973751319723837),
 (77, 0.2560122597417548),
 (97, 0.25361988261846297)]

Each tuple contains two values, the first is the concept cluster and the second the similarity to the search term. The top 5 similar topics are returned.

Now, let us visualize those concepts to see how well the search function works:

concept_model.visualize_concepts(concepts=[100, 53, 95, 77, 97])

More Repositories

1

BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
Python
4,444
star
2

KeyBERT

Minimal keyword extraction with BERT
Python
2,474
star
3

PolyFuzz

Fuzzy string matching, grouping, and evaluation.
Python
649
star
4

soan

Social Analysis based on Whatsapp data
Python
124
star
5

cTFIDF

Creating class-based TF-IDF matrices
Python
67
star
6

ML-API

Guide on creating an API for serving your ML model
Jupyter Notebook
63
star
7

Projects

Data Science Portfolio
Jupyter Notebook
63
star
8

ReinLife

Creating Artificial Life with Reinforcement Learning
Python
56
star
9

CustomerSegmentation

Analysis for Customer Segmentation
Jupyter Notebook
56
star
10

streamlit_guide

A guide on creating and deploying your Streamlit application to Heroku
Python
47
star
11

feature-engineering

Tips for Advanced Feature Engineering
Jupyter Notebook
47
star
12

BERTopic_evaluation

Code and experiments for *BERTopic: Neural topic modeling with a class-based TF-IDF procedure*
Python
40
star
13

boardgame

Heroku app to explore boardgame data
Jupyter Notebook
20
star
14

UnitTesting

Guide for applying Unit Testing in data-driven projects
Python
18
star
15

Sprite-Generator

Python procedural sprite generator
Jupyter Notebook
15
star
16

VLAC

Vectors of Locally Aggregated Concepts
Jupyter Notebook
10
star
17

Reviewer

Tool for extracting and analyzing IMDB reviews
Jupyter Notebook
7
star
18

InterpretableML

My analyses for interpretable Machine Learning
Jupyter Notebook
7
star
19

validation

Overview of validation techniques
Jupyter Notebook
6
star
20

ReinforcementLearning

Train SOTA RL-algorithms using Stable Baselines andΒ Gym
Jupyter Notebook
4
star
21

cars_dashboard

Dashboard for the cars dataset
Python
3
star
22

MaartenGr

Python
3
star
23

PotholeDetection

Detection of Potholes in Images
Jupyter Notebook
2
star
24

fitbit

Analysis of my FitBit data
Jupyter Notebook
1
star
25

DisneyTournament

Statistically Generated Disney Tournament Bracket
Jupyter Notebook
1
star
26

BoardGames

Analysis of board game matches
Jupyter Notebook
1
star