• Stars
    star
    2,998
  • Rank 15,077 (Top 0.3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

🚨 Announcements

  • (1/29/23) We have merged a new index updater feature and support for additional Hugging Face models! These are in beta so please give us feedback as you try them out.
  • (1/24/23) If you're looking for the DSP framework for composing ColBERTv2 and LLMs, it's at: https://github.com/stanfordnlp/dsp

ColBERT (v2)

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.

As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

These rich interactions allow ColBERT to surpass the quality of single-vector representation models, while scaling efficiently to large corpora. You can read more in our papers:


ColBERTv1

The ColBERTv1 code from the SIGIR'20 paper is in the colbertv1 branch. See here for more information on other branches.

Installation

ColBERT requires Python 3.7+ and Pytorch 1.9+ and uses the Hugging Face Transformers library.

We strongly recommend creating a conda environment using the commands below. (If you don't have conda, follow the official conda installation guide.)

We have also included a new environment file specifically for CPU-only environments (conda_env_cpu.yml), but note that if you are testing CPU execution on a machine that includes GPUs you might need to specify CUDA_VISIBLE_DEVICES="" as part of your command. Note that a GPU is required for training and indexing.

conda env create -f conda_env[_cpu].yml
conda activate colbert

If you face any problems, please open a new issue and we'll help you promptly!

Overview

Using ColBERT on a dataset typically involves the following steps.

Step 0: Preprocess your collection. At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., collection.tsv) will contain all passages and another (e.g., queries.tsv) will contain a set of queries for searching the collection.

Step 1: Download the pre-trained ColBERTv2 checkpoint. This checkpoint has been trained on the MS MARCO Passage Ranking task. You can also optionally train your own ColBERT model.

Step 2: Index your collection. Once you have a trained ColBERT model, you need to index your collection to permit fast retrieval. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.

Step 3: Search the collection with your queries. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.

Below, we illustrate these steps via an example run on the MS MARCO Passage Ranking task.

API Usage Notebook

NEW: We have an experimental notebook on Google Colab that you can use with free GPUs. Indexing 10,000 on the free Colab T4 GPU takes six minutes.

This Jupyter notebook docs/intro.ipynb notebook illustrates using the key features of ColBERT with the new Python API.

It includes how to download the ColBERTv2 model checkpoint trained on MS MARCO Passage Ranking and how to download our new LoTTE benchmark.

Data

This repository works directly with a simple tab-separated file format to store queries, passages, and top-k ranked lists.

  • Queries: each line is qid \t query text.
  • Collection: each line is pid \t passage text.
  • Top-k Ranking: each line is qid \t pid \t rank.

This works directly with the data format of the MS MARCO Passage Ranking dataset. You will need the training triples (triples.train.small.tar.gz), the official top-1000 ranked lists for the dev set queries (top1000.dev), and the dev set relevant passages (qrels.dev.small.tsv). For indexing the full collection, you will also need the list of passages (collection.tar.gz).

Indexing

For fast retrieval, indexing precomputes the ColBERT representations of passages.

Example usage:

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer

if __name__=='__main__':
    with Run().context(RunConfig(nranks=1, experiment="msmarco")):

        config = ColBERTConfig(
            nbits=2,
            root="/path/to/experiments",
        )
        indexer = Indexer(checkpoint="/path/to/checkpoint", config=config)
        indexer.index(name="msmarco.nbits=2", collection="/path/to/MSMARCO/collection.tsv")

Retrieval

We typically recommend that you use ColBERT for end-to-end retrieval, where it directly finds its top-k passages from the full collection:

from colbert.data import Queries
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Searcher

if __name__=='__main__':
    with Run().context(RunConfig(nranks=1, experiment="msmarco")):

        config = ColBERTConfig(
            root="/path/to/experiments",
        )
        searcher = Searcher(index="msmarco.nbits=2", config=config)
        queries = Queries("/path/to/MSMARCO/queries.dev.small.tsv")
        ranking = searcher.search_all(queries, k=100)
        ranking.save("msmarco.nbits=2.ranking.tsv")

You can optionally specify the ncells, centroid_score_threshold, and ndocs search hyperparameters to trade off between speed and result quality. Defaults for different values of k are listed in colbert/searcher.py.

We can evaluate the MSMARCO rankings using the following command:

python -m utility.evaluate.msmarco_passages --ranking "/path/to/msmarco.nbits=2.ranking.tsv" --qrels "/path/to/MSMARCO/qrels.dev.small.tsv"

Training

We provide a pre-trained model checkpoint, but we also detail how to train from scratch here. Note that this example demonstrates the ColBERTv1 style of training, but the provided checkpoint was trained with ColBERTv2.

Training requires a JSONL triples file with a [qid, pid+, pid-] list per line. The query IDs and passage IDs correspond to the specified queries.tsv and collection.tsv files respectively.

Example usage (training on 4 GPUs):

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Trainer

if __name__=='__main__':
    with Run().context(RunConfig(nranks=4, experiment="msmarco")):

        config = ColBERTConfig(
            bsize=32,
            root="/path/to/experiments",
        )
        trainer = Trainer(
            triples="/path/to/MSMARCO/triples.train.small.tsv",
            queries="/path/to/MSMARCO/queries.train.small.tsv",
            collection="/path/to/MSMARCO/collection.tsv",
            config=config,
        )

        checkpoint_path = trainer.train()

        print(f"Saved checkpoint to {checkpoint_path}...")

Running a lightweight ColBERTv2 server

We provide a script to run a lightweight server which serves k (upto 100) results in ranked order for a given search query, in JSON format. This script can be used to power DSP programs.

To run the server, update the environment variables INDEX_ROOT and INDEX_NAME in the .env file to point to the appropriate ColBERT index. The run the following command:

python server.py

A sample query:

http://localhost:8893/api/search?query=Who won the 2022 FIFA world cup&k=25

Branches

Supported branches

  • main: Stable branch with ColBERTv2 + PLAID.
  • colbertv1: Legacy branch for ColBERTv1.

Deprecated branches

  • new_api: Base ColBERTv2 implementation.
  • cpu_inference: ColBERTv2 implementation with CPU search support.
  • fast_search: ColBERTv2 implementation with PLAID.
  • binarization: ColBERT with a baseline binarization-based compression strategy (as opposed to ColBERTv2's residual compression, which we found to be more robust).

More Repositories

1

macrobase

MacroBase: A Search Engine for Fast Data
Java
661
star
2

ARES

Automated Evaluation of RAG Systems
Python
460
star
3

noscope

Accelerating network inference over video
Python
434
star
4

sparser

Sparser: Raw Filtering for Faster Analytics over Raw Data
C
427
star
5

dawn-bench-entries

DAWNBench: An End-to-End Deep Learning Benchmark and Competition
Python
257
star
6

ASAP

ASAP: Prioritizing Attention via Time Series Smoothing
Jupyter Notebook
184
star
7

FrugalGPT

FrugalGPT: better quality and lower cost for LLM applications
Python
167
star
8

index-baselines

Simple baselines for "Learned Indexes"
HTML
156
star
9

FAST

End-to-end earthquake detection pipeline via efficient time series similarity search
Jupyter Notebook
144
star
10

gavel

Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
Jupyter Notebook
124
star
11

equivariant-transformers

Equivariant Transformer (ET) layers are image-to-image mappings that incorporate prior knowledge on invariances with respect to continuous transformations groups (ICML 2019). Paper: https://arxiv.org/abs/1901.11399
Jupyter Notebook
88
star
12

stk

Python
86
star
13

selection-via-proxy

Python
77
star
14

sinkhorn-label-allocation

Sinkhorn Label Allocation is a label assignment method for semi-supervised self-training algorithms. The SLA algorithm is described in full in this ICML 2021 paper: https://arxiv.org/abs/2102.08622.
Python
53
star
15

readinggroup

45
star
16

cs145-2017

Jupyter Notebook
43
star
17

Willump

Willump Is a Low-Latency Useful Machine learning Platform.
Python
42
star
18

Baleen

Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21)
Python
42
star
19

msketch

Moments Sketch Code
Jupyter Notebook
39
star
20

Uniserve

A runtime implementation of data-parallel actors.
Java
38
star
21

wmsketch

Sketching linear classifiers over data streams with the Weight-Median Sketch (SIGMOD 2018).
C++
38
star
22

dawn-bench-models

Python
36
star
23

momentsketch

Simplified Moment Sketch Implemntation
Java
36
star
24

blazeit

Its BlazeIt because it's blazing fast
C++
28
star
25

optimus-maximus

To Index or Not to Index: Optimizing Exact Maximum Inner Product Search
Python
26
star
26

ACORN

state-of-the-art search over vector embeddings and structured data (SIGMOD '24)
C++
25
star
27

acidrain

2AD analysis prototype and logs from sample applications
Python
22
star
28

lit-code

Code for LIT, ICML 2019
Python
21
star
29

POP

Code for "Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP", which appeared at SOSP 2021
Python
20
star
30

loa

Public code for LOA
Python
18
star
31

omg

Python
17
star
32

pytorch-distributed

Fork of diux-dev/imagenet18
Python
13
star
33

tasti

Semantic Indexes for Machine Learning-based Queries over Unstructured Data (SIGMOD 2022)
Python
13
star
34

cs245-as1

Student files for CS245 Programming Assignment 1: In-memory data layout
Java
12
star
35

offload-annotations

A new approach for bringing heterogeneous computing to existing libraries and workloads.
Python
9
star
36

Willump-Simple

Willump Is a Low-Latency Useful Machine learning Platform.
Python
8
star
37

cs245-as3-public

Durable transactions assignment for CS245
Java
7
star
38

InQuest

Accelerating Aggregation Queries on Unstructured Streams of Data
Python
7
star
39

cs245-as2-public

Scala
7
star
40

training_on_a_dime

Scripts and logs for "Analysis and Expoitation of Dynamic Pricing in the Public Cloud for ML Training", which is to appear at DISPA 2020
Jupyter Notebook
7
star
41

SparseJointShift

Model Performance Estimation and Explanation When Labels and A Few Features Shifts
Python
7
star
42

DROP

Java
6
star
43

tKDC

Repository for tKDE Experiments
Jupyter Notebook
6
star
44

sketchstore

Algorithms for compressing and merging large collections of sketches
Jupyter Notebook
5
star
45

parallel-lb-simulator

Java
4
star
46

crosstrainer

CrossTrainer: Practical Domain Adaptation with Loss Reweighting
Python
4
star
47

smol

C++
4
star
48

supg

Python
3
star
49

fast-tree

C++
3
star
50

abae

Accelerating Approximate Aggregation Queries with Expensive Predicates (VLDB 21)
Python
3
star
51

graphIO

Automated Lower Bounds on the I/O Complexity of Computation Graphs
Python
3
star
52

futuretea-whyrust

Why Rust presentation at FutureTea, 3/13
Rust
3
star
53

ezmode

An iterative algorithm for selecting rare events in large, unlabeled datasets
Python
1
star
54

willump-dfs

Applying Willump design to deep feature synthesis
Python
1
star
55

fexipro-benchmarking

C++
1
star
56

macrobase-cpp

1
star
57

swag-python

Situationally aWAre decodinG
Python
1
star