• Stars
    star
    748
  • Rank 60,661 (Top 2 %)
  • Language
    Python
  • License
    Other
  • Created over 3 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

SPLADE: sparse neural search (SIGIR21, SIGIR22)

SPLADE

paper blog huggingface weights weights

What's New:

This repository contains the code to perform training, indexing and retrieval for SPLADE models. It also includes everything needed to launch evaluation on the BEIR benchmark.

TL; DR SPLADE is a neural retrieval model which learns query/document sparse expansion via the BERT MLM head and sparse regularization. Sparse representations benefit from several advantages compared to dense approaches: efficient use of inverted index, explicit lexical match, interpretability... They also seem to be better at generalizing on out-of-domain data (BEIR benchmark).

By benefiting from recent advances in training neural retrievers, our v2 models rely on hard-negative mining, distillation and better Pre-trained Language Model initialization to further increase their effectiveness, on both in-domain (MS MARCO) and out-of-domain evaluation (BEIR benchmark).

Finally, by introducing several modifications (query specific regularization, disjoint encoders etc.), we are able to improve efficiency, achieving latency on par with BM25 under the same computing constraints.

Weights for models trained under various settings can be found on Naver Labs Europe website, as well as Hugging Face. Please bear in mind that SPLADE is more a class of models rather than a model per se: depending on the regularization magnitude, we can obtain different models (from very sparse to models doing intense query/doc expansion) with different properties and performance.

splade: a spork that is sharp along one edge or both edges, enabling it to be used as a knife, a fork and a spoon.


Getting started ๐Ÿš€

Requirements

We recommend to start from a fresh environment, and install the packages from conda_splade_env.yml.

conda create -n splade_env python=3.9
conda activate splade_env
conda env create -f conda_splade_env.yml

Usage

Playing with the model

inference_splade.ipynb allows you to load and perform inference with a trained model, in order to inspect the predicted "bag-of-expanded-words". We provide weights for six main models:

model MRR@10 (MS MARCO dev)
naver/splade_v2_max (v2 HF) 34.0
naver/splade_v2_distil (v2 HF) 36.8
naver/splade-cocondenser-selfdistil (SPLADE++, HF) 37.6
naver/splade-cocondenser-ensembledistil (SPLADE++, HF) 38.3
naver/efficient-splade-V-large-doc (HF) + naver/efficient-splade-V-large-query (HF) (efficient SPLADE) 38.8
naver/efficient-splade-VI-BT-large-doc (HF) + efficient-splade-VI-BT-large-query (HF) (efficient SPLADE) 38.0

We also uploaded various models here. Feel free to try them out!

High level overview of the code structure

  • This repository lets you either train (train.py), index (index.py), retrieve (retrieve.py) (or perform every step with all.py) SPLADE models.
  • To manage experiments, we rely on hydra. Please refer to conf/README.md for a complete guide on how we configured experiments.

Data

  • To train models, we rely on MS MARCO data.
  • We also further rely on distillation and hard negative mining, from available datasets (Margin MSE Distillation , Sentence Transformers Hard Negatives) or datasets we built ourselves (e.g. negatives mined from SPLADE).
  • Most of the data formats are pretty standard; for validation, we rely on an approximate validation set, following a setting similar to TAS-B.

To simplify setting up, we made available all our data folders, which can be downloaded here. This link includes queries, documents and hard negative data, allowing for training under the EnsembleDistil setting (see v2bis paper). For other settings (Simple, DistilMSE, SelfDistil), you also have to download:

After downloading, you can just untar in the root directory, and it will be placed in the right folder.

tar -xzvf file.tar.gz

Quick start

In order to perform all steps (here on toy data, i.e. config_default.yaml), go on the root directory and run:

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_default.yaml"
python3 -m splade.all \
  config.checkpoint_dir=experiments/debug/checkpoint \
  config.index_dir=experiments/debug/index \
  config.out_dir=experiments/debug/out

Additional examples

We provide additional examples that can be plugged in the above code. See conf/README.md for details on how to change experiment settings.

  • you can similarly run training python3 -m splade.train (same for indexing or retrieval)
  • to create Anserini readable files (after training), run SPLADE_CONFIG_FULLPATH=/path/to/checkpoint/dir/config.yaml python3 -m splade.create_anserini +quantization_factor_document=100 +quantization_factor_query=100
  • config files for various settings (distillation etc.) are available in /conf. For instance, to run the SelfDistil setting:
    • change to SPLADE_CONFIG_NAME=config_splade++_selfdistil.yaml
    • to further change parameters (e.g. lambdas) outside the config, run: python3 -m splade.all config.regularizer.FLOPS.lambda_q=0.06 config.regularizer.FLOPS.lambda_d=0.02

We provide several base configurations which correspond to the experiments in the v2bis and "efficiency" papers. Please note that these are suited for our hardware setting, i.e. 4 GPUs Tesla V100 with 32GB memory. In order to train models with e.g. one GPU, you need to decrease the batch size for training and evaluation. Also note that, as the range for the loss might change with a different batch size, corresponding lambdas for regularization might need to be adapted. However, we provide a mono-gpu configuration config_splade++_cocondenser_ensembledistil_monogpu.yaml for which we obtain 37.2 MRR@10, trained on a single 16GB GPU.

Evaluating a pre-trained model

Indexing (and retrieval) can be done either using our (numba-based) implementation of inverted index, or Anserini. Let's perform these steps using an available model (naver/splade-cocondenser-ensembledistil).

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_splade++_cocondenser_ensembledistil"
python3 -m splade.index \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index
python3 -m splade.retrieve \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  config.out_dir=experiments/pre-trained/out
# pretrained_no_yamlconfig indicates that we solely rely on a HF-valid model path
  • To change the data, simply override the hydra retrieve_evaluate package, e.g. add retrieve_evaluate=msmarco as argument of splade.retrieve.

You can similarly build the files that will be ingested by Anserini:

python3 -m splade.create_anserini \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  +quantization_factor_document=100 \
  +quantization_factor_query=100

It will create the json collection (docs_anserini.jsonl) as well as the queries (queries_anserini.tsv) that are needed for Anserini. You then just need to follow the regression for SPLADE here in order to index and retrieve.

BEIR eval

You can also run evaluation on BEIR, for instance:

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_FULLPATH="/path/to/checkpoint/dir/config.yaml"
for dataset in arguana fiqa nfcorpus quora scidocs scifact trec-covid webis-touche2020 climate-fever dbpedia-entity fever hotpotqa nq
do
    python3 -m splade.beir_eval \
      +beir.dataset=$dataset \
      +beir.dataset_path=data/beir \
      config.index_retrieve_batch_size=100
done

PISA evaluation

We provide in efficient_splade_pisa/README.md the steps to evaluate efficient SPLADE models with PISA.


Cite ๐Ÿ“œ

Please cite our work as:

  • (v1) SIGIR21 short paper
@inbook{10.1145/3404835.3463098,
author = {Formal, Thibault and Piwowarski, Benjamin and Clinchant, St\'{e}phane},
title = {SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking},
year = {2021},
isbn = {9781450380379},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3404835.3463098},
booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2288โ€“2292},
numpages = {5}
}
  • (v2) arxiv
@misc{https://doi.org/10.48550/arxiv.2109.10086,
  doi = {10.48550/ARXIV.2109.10086},
  url = {https://arxiv.org/abs/2109.10086},
  author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, Stรฉphane},
  keywords = {Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval},
  publisher = {arXiv},
  year = {2021},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}
  • (v2bis) SPLADE++, SIGIR22 short paper
@inproceedings{10.1145/3477495.3531857,
author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, St\'{e}phane},
title = {From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531857},
doi = {10.1145/3477495.3531857},
abstract = {Neural retrievers based on dense representations combined with Approximate Nearest Neighbors search have recently received a lot of attention, owing their success to distillation and/or better sampling of examples for training -- while still relying on the same backbone architecture. In the meantime, sparse representation learning fueled by traditional inverted indexing techniques has seen a growing interest, inheriting from desirable IR priors such as explicit lexical matching. While some architectural variants have been proposed, a lesser effort has been put in the training of such models. In this work, we build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models, by studying the effect of distillation, hard-negative mining as well as the Pre-trained Language Model initialization. We furthermore study the link between effectiveness and efficiency, on in-domain and zero-shot settings, leading to state-of-the-art results in both scenarios for sufficiently expressive models.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2353โ€“2359},
numpages = {7},
keywords = {neural networks, indexing, sparse representations, regularization},
location = {Madrid, Spain},
series = {SIGIR '22}
}
  • efficient SPLADE, SIGIR22 short paper
@inproceedings{10.1145/3477495.3531833,
author = {Lassance, Carlos and Clinchant, St\'{e}phane},
title = {An Efficiency Study for SPLADE Models},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531833},
doi = {10.1145/3477495.3531833},
abstract = {Latency and efficiency issues are often overlooked when evaluating IR models based on Pretrained Language Models (PLMs) in reason of multiple hardware and software testing scenarios. Nevertheless, efficiency is an important part of such systems and should not be overlooked. In this paper, we focus on improving the efficiency of the SPLADE model since it has achieved state-of-the-art zero-shot performance and competitive results on TREC collections. SPLADE efficiency can be controlled via a regularization factor, but solely controlling this regularization has been shown to not be efficient enough. In order to reduce the latency gap between SPLADE and traditional retrieval systems, we propose several techniques including L1 regularization for queries, a separation of document/query encoders, a FLOPS-regularized middle-training, and the use of faster query encoders. Our benchmark demonstrates that we can drastically improve the efficiency of these models while increasing the performance metrics on in-domain data. To our knowledge, we propose the first neural models that, under the same computing constraints, achieve similar latency (less than 4ms difference) as traditional BM25, while having similar performance (less than 10% MRR@10 reduction) as the state-of-the-art single-stage neural rankers on in-domain data.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2220โ€“2226},
numpages = {7},
keywords = {splade, sparse representations, latency, information retrieval},
location = {Madrid, Spain},
series = {SIGIR '22}
}

Contact ๐Ÿ“ญ

Feel free to contact us via Twitter or by mail @ [email protected] !

License

SPLADE Copyright (c) 2021-present NAVER Corp.

SPLADE is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. (see license)

You should have received a copy of the license along with this work. If not, see http://creativecommons.org/licenses/by-nc-sa/4.0/ .

More Repositories

1

billboard.js

๐Ÿ“Š Re-usable, easy interface JavaScript chart library based on D3.js
TypeScript
5,812
star
2

fe-news

FE ๊ธฐ์ˆ  ์†Œ์‹ ํ๋ ˆ์ด์…˜ ๋‰ด์Šค๋ ˆํ„ฐ
5,635
star
3

dust3r

DUSt3R: Geometric 3D Vision Made Easy
Python
4,919
star
4

egjs-flicking

๐ŸŽ  โ™ป๏ธ Everyday 30 million people experience. It's reliable, flexible and extendable carousel.
TypeScript
2,551
star
5

egjs-infinitegrid

A module used to arrange card elements including content infinitely on a grid layout.
TypeScript
2,187
star
6

ngrinder

enterprise level performance testing solution
Java
1,788
star
7

d2codingfont

D2 Coding ๊ธ€๊ผด
1,774
star
8

egjs

Javascript components group that brings easiest and fastest way to build a web application in your way.
JavaScript
922
star
9

mast3r

Grounding Image Matching in 3D with MASt3R
Python
731
star
10

biobert-pretrained

BioBERT: a pre-trained biomedical language representation model for biomedical text mining
651
star
11

deep-image-retrieval

End-to-end learning of deep visual representations for image retrieval
Python
643
star
12

sqlova

Python
631
star
13

fixture-monkey

Let Fixture Monkey generate test instances including edge cases automatically
Java
549
star
14

roma

RoMa: A lightweight library to deal with 3D rotations in PyTorch.
Python
493
star
15

r2d2

Python
468
star
16

kapture

kapture is a file format as well as a set of tools for manipulating datasets, and in particular Visual Localization and Structure from Motion data.
Python
466
star
17

egjs-view360

360 integrated viewing solution
TypeScript
438
star
18

scavenger

A runtime dead code analysis tool
Java
400
star
19

yobi

Project hosting software - Deprecated
Java
379
star
20

lispe

An implementation of a full fledged Lisp interpreter with Data Structure, Pattern Programming and High level Functions with Lazy Evaluation ร  la Haskell.
C
369
star
21

lucy-xss-filter

HTML
319
star
22

arcus

ARCUS is the NAVER memcached with lists, sets, maps and b+trees. http://naver.github.io/arcus
Shell
302
star
23

egjs-grid

A component that can arrange items according to the type of grids
TypeScript
275
star
24

spring-jdbc-plus

Spring JDBC Plus
Java
274
star
25

kapture-localization

Provide mapping and localization pipelines based on kapture format
Python
266
star
26

android-imagecropview

android image crop library
Java
250
star
27

croco

Python
249
star
28

smarteditor2

Javascript WYSIWYG HTML editor
JavaScript
241
star
29

lucy-xss-servlet-filter

Java
237
star
30

kor2vec

OOV์—†์ด ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ํ•œ๊ตญ์–ด Embedding ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
Python
219
star
31

claf

CLaF: Open-Source Clova Language Framework
Python
215
star
32

eslint-config-naver

Naver JavaScript Coding Conventions rules for eslint
JavaScript
205
star
33

tamgu

Tamgu (ํƒ๊ตฌ), a FIL programming language: Functional, Imperative, Logical all in one for annotation and data augmentation
C++
199
star
34

egjs-view3d

Fast & customizable 3D model viewer for everyone
TypeScript
192
star
35

multi-hmr

Pytorch demo code and models for Multi-HMR
Python
178
star
36

nlp-challenge

NLP Shared tasks (NER, SRL) using NSML
Python
177
star
37

hackday-conventions-java

์บ ํผ์Šค ํ•ต๋ฐ์ด Java ์ฝ”๋”ฉ ์ปจ๋ฒค์…˜
173
star
38

nbase-arc

nbase-arc is an open source distributed memory store based on Redis
C
171
star
39

nanumfont

170
star
40

egjs-axes

A module used to change the information of user action entered by various input devices such as touch screen or mouse into the logical virtual coordinates.
TypeScript
150
star
41

cgd

Combination of Multiple Global Descriptors for Image Retrieval
Python
147
star
42

naver-openapi-guide

CSS
135
star
43

volley-extensions

Volley Extensions v2.0.0. ( Volleyer, Volley requests, Volley caches, Volley custom views )
Java
134
star
44

fire

Python
128
star
45

tldr

TLDR is an unsupervised dimensionality reduction method that combines neighborhood embedding learning with the simplicity and effectiveness of recent self-supervised learning losses
Python
123
star
46

pr-stats

PR์— ๋Œ€ํ•œ ์œ ์šฉํ•œ ํ†ต๊ณ„๋ฅผ ์‚ฐ์ถœํ•˜๋Š” GitHub Actions
TypeScript
122
star
47

PoseGPT

Python
119
star
48

grabcutios

Image segmentation using GrabCut algorithm for iOS
C++
118
star
49

sling

C++
117
star
50

gdc

Code accompanying our papers on the "Generative Distributional Control" framework
Python
116
star
51

naveridlogin-sdk-android

๋„ค์ด๋ฒ„ ์•„์ด๋””๋กœ ๋กœ๊ทธ์ธ SDK (์•ˆ๋“œ๋กœ์ด๋“œ)
Kotlin
114
star
52

posescript

Python
114
star
53

egjs-conveyer

Conveyer adds Drag gestures to your Native Scroll.
TypeScript
113
star
54

spring-batch-plus

Add useful features to spring batch
Kotlin
111
star
55

cfcs

Write once, create framework components that supports React, Vue, Svelte, and more.
TypeScript
102
star
56

egjs-agent

Extracts browser and operating system information from the user agent string or user agent object(userAgentData).
TypeScript
100
star
57

searchad-apidoc

Java
98
star
58

dope

Python
92
star
59

bergen

Benchmarking library for RAG
Jupyter Notebook
88
star
60

imagestabilizer

C++
77
star
61

guitar

AutoIt
75
star
62

arcus-memcached

ARCUS memory cache server
C
71
star
63

disco

A Toolkit for Distributional Control of Generative Models
Python
68
star
64

prism-live-studio

C++
63
star
65

cover-checker

Check your pull request code coverage
Java
63
star
66

egjs-list-differ

โž•โž–๐Ÿ”„ A module that checks the diff when values are added, removed, or changed in an array.
TypeScript
63
star
67

storybook-addon-preview

Storybook Addon Preview can show user selected knobs in various framework code in Storybook
TypeScript
63
star
68

svc

Easy and intuitive pattern for Android
Kotlin
62
star
69

egjs-imready

I'm Ready to check if the images or videos are loaded!
TypeScript
60
star
70

egjs-flicking-plugins

Plugins for @egjs/flicking
TypeScript
60
star
71

naveridlogin-sdk-ios

Objective-C
59
star
72

garnet

Python
57
star
73

clova-face-kit

On-device lightweight face recognition. Available on Android, iOS, WASM, Python.
57
star
74

rye

RYE, Native Sharding RDBMS
C
54
star
75

hubblemon

Python
54
star
76

zeplin-flutter-gen

๐Ÿš€The Flutter dart code generator from zeplin. ex) Container, Text, Color, TextStyle, ... - Save your time.
JavaScript
53
star
77

egjs-visible

A class that checks if an element is visible in the base element or viewport.
HTML
52
star
78

arcus-java-client

ARCUS Java client
Java
50
star
79

aqm-plus

PyTorch code for Large-Scale Answerer in Questioner's Mind for Visual Dialog Question Generation (AQM+) (ICLR 2019)
Python
50
star
80

isometrizer

Isometrizer turns your DOM elements into isometric projection
TypeScript
48
star
81

artemis

Official code release for ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity (published at ICLR 2022)
Python
46
star
82

jindojs-jindo

Jindo JavaScript Framework
JavaScript
44
star
83

covid19-nmt

Multi-lingual & multi-domain (specialisation for biomedical data) translation model
Python
40
star
84

react-sample-code

์ด ํ”„๋กœ์ ํŠธ๋Š” hello world์— ๊ณต๊ฐœํ•œ React ๊ฐœ๋ฐœ ๊ฐ€์ด๋“œ์— ํ•„์š”ํ•œ ์ƒ˜ํ”Œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.
JavaScript
39
star
85

pump

Python
39
star
86

posebert

Python
39
star
87

passport-naver

A passport strategy for Naver OAuth 2.0
JavaScript
38
star
88

hadoop

Public hadoop release repository
Java
38
star
89

kaist-oss-course

Introduction to Open Source Software class @ KAIST 2016
38
star
90

egjs-component

A class used to manage events in a component like DOM
TypeScript
38
star
91

graphql-dataloader-mongoose

graphql-dataloader-mongoose is a DataLoader generator based on an existing Mongoose model
TypeScript
38
star
92

egjs-persist

Provide cache interface to handle persisted data among history navigation.
JavaScript
38
star
93

shine

[CVPR'24 Highlight] SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection
Python
36
star
94

naver-spring-batch-ex

Java
33
star
95

naverspeech-sdk-ios

Swift
32
star
96

reflect

C++ class reflection library without RTTI.
C++
32
star
97

android-utilset

Utilset is collections of useful functions to save your valuable time.
Java
32
star
98

cafe-sdk-unity

31
star
99

image-maps

jquery plugin which can be partially linked to the image
JavaScript
31
star
100

mesh-simplifier

Collection of mesh simplification methods written in Typescript
TypeScript
30
star