• Stars
    star
    618
  • Rank 70,102 (Top 2 %)
  • Language
    Python
  • License
    Other
  • Created almost 3 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

SPLADE: sparse neural search (SIGIR21, SIGIR22)

SPLADE

paper blog huggingface weights weights

What's New:

This repository contains the code to perform training, indexing and retrieval for SPLADE models. It also includes everything needed to launch evaluation on the BEIR benchmark.

TL; DR SPLADE is a neural retrieval model which learns query/document sparse expansion via the BERT MLM head and sparse regularization. Sparse representations benefit from several advantages compared to dense approaches: efficient use of inverted index, explicit lexical match, interpretability... They also seem to be better at generalizing on out-of-domain data (BEIR benchmark).

By benefiting from recent advances in training neural retrievers, our v2 models rely on hard-negative mining, distillation and better Pre-trained Language Model initialization to further increase their effectiveness, on both in-domain (MS MARCO) and out-of-domain evaluation (BEIR benchmark).

Finally, by introducing several modifications (query specific regularization, disjoint encoders etc.), we are able to improve efficiency, achieving latency on par with BM25 under the same computing constraints.

Weights for models trained under various settings can be found on Naver Labs Europe website, as well as Hugging Face. Please bear in mind that SPLADE is more a class of models rather than a model per se: depending on the regularization magnitude, we can obtain different models (from very sparse to models doing intense query/doc expansion) with different properties and performance.

splade: a spork that is sharp along one edge or both edges, enabling it to be used as a knife, a fork and a spoon.


Getting started ๐Ÿš€

Requirements

We recommend to start from a fresh environment, and install the packages from conda_splade_env.yml.

conda create -n splade_env python=3.9
conda activate splade_env
conda env create -f conda_splade_env.yml

Usage

Playing with the model

inference_splade.ipynb allows you to load and perform inference with a trained model, in order to inspect the predicted "bag-of-expanded-words". We provide weights for six main models:

model MRR@10 (MS MARCO dev)
naver/splade_v2_max (v2 HF) 34.0
naver/splade_v2_distil (v2 HF) 36.8
naver/splade-cocondenser-selfdistil (SPLADE++, HF) 37.6
naver/splade-cocondenser-ensembledistil (SPLADE++, HF) 38.3
naver/efficient-splade-V-large-doc (HF) + naver/efficient-splade-V-large-query (HF) (efficient SPLADE) 38.8
naver/efficient-splade-VI-BT-large-doc (HF) + efficient-splade-VI-BT-large-query (HF) (efficient SPLADE) 38.0

We also uploaded various models here. Feel free to try them out!

High level overview of the code structure

  • This repository lets you either train (train.py), index (index.py), retrieve (retrieve.py) (or perform every step with all.py) SPLADE models.
  • To manage experiments, we rely on hydra. Please refer to conf/README.md for a complete guide on how we configured experiments.

Data

  • To train models, we rely on MS MARCO data.
  • We also further rely on distillation and hard negative mining, from available datasets (Margin MSE Distillation , Sentence Transformers Hard Negatives) or datasets we built ourselves (e.g. negatives mined from SPLADE).
  • Most of the data formats are pretty standard; for validation, we rely on an approximate validation set, following a setting similar to TAS-B.

To simplify setting up, we made available all our data folders, which can be downloaded here. This link includes queries, documents and hard negative data, allowing for training under the EnsembleDistil setting (see v2bis paper). For other settings (Simple, DistilMSE, SelfDistil), you also have to download:

After downloading, you can just untar in the root directory, and it will be placed in the right folder.

tar -xzvf file.tar.gz

Quick start

In order to perform all steps (here on toy data, i.e. config_default.yaml), go on the root directory and run:

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_default.yaml"
python3 -m splade.all \
  config.checkpoint_dir=experiments/debug/checkpoint \
  config.index_dir=experiments/debug/index \
  config.out_dir=experiments/debug/out

Additional examples

We provide additional examples that can be plugged in the above code. See conf/README.md for details on how to change experiment settings.

  • you can similarly run training python3 -m splade.train (same for indexing or retrieval)
  • to create Anserini readable files (after training), run SPLADE_CONFIG_FULLPATH=/path/to/checkpoint/dir/config.yaml python3 -m splade.create_anserini +quantization_factor_document=100 +quantization_factor_query=100
  • config files for various settings (distillation etc.) are available in /conf. For instance, to run the SelfDistil setting:
    • change to SPLADE_CONFIG_NAME=config_splade++_selfdistil.yaml
    • to further change parameters (e.g. lambdas) outside the config, run: python3 -m splade.all config.regularizer.FLOPS.lambda_q=0.06 config.regularizer.FLOPS.lambda_d=0.02

We provide several base configurations which correspond to the experiments in the v2bis and "efficiency" papers. Please note that these are suited for our hardware setting, i.e. 4 GPUs Tesla V100 with 32GB memory. In order to train models with e.g. one GPU, you need to decrease the batch size for training and evaluation. Also note that, as the range for the loss might change with a different batch size, corresponding lambdas for regularization might need to be adapted. However, we provide a mono-gpu configuration config_splade++_cocondenser_ensembledistil_monogpu.yaml for which we obtain 37.2 MRR@10, trained on a single 16GB GPU.

Evaluating a pre-trained model

Indexing (and retrieval) can be done either using our (numba-based) implementation of inverted index, or Anserini. Let's perform these steps using an available model (naver/splade-cocondenser-ensembledistil).

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_NAME="config_splade++_cocondenser_ensembledistil"
python3 -m splade.index \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index
python3 -m splade.retrieve \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  config.out_dir=experiments/pre-trained/out
# pretrained_no_yamlconfig indicates that we solely rely on a HF-valid model path
  • To change the data, simply override the hydra retrieve_evaluate package, e.g. add retrieve_evaluate=msmarco as argument of splade.retrieve.

You can similarly build the files that will be ingested by Anserini:

python3 -m splade.create_anserini \
  init_dict.model_type_or_dir=naver/splade-cocondenser-ensembledistil \
  config.pretrained_no_yamlconfig=true \
  config.index_dir=experiments/pre-trained/index \
  +quantization_factor_document=100 \
  +quantization_factor_query=100

It will create the json collection (docs_anserini.jsonl) as well as the queries (queries_anserini.tsv) that are needed for Anserini. You then just need to follow the regression for SPLADE here in order to index and retrieve.

BEIR eval

You can also run evaluation on BEIR, for instance:

conda activate splade_env
export PYTHONPATH=$PYTHONPATH:$(pwd)
export SPLADE_CONFIG_FULLPATH="/path/to/checkpoint/dir/config.yaml"
for dataset in arguana fiqa nfcorpus quora scidocs scifact trec-covid webis-touche2020 climate-fever dbpedia-entity fever hotpotqa nq
do
    python3 -m splade.beir_eval \
      +beir.dataset=$dataset \
      +beir.dataset_path=data/beir \
      config.index_retrieve_batch_size=100
done

PISA evaluation

We provide in efficient_splade_pisa/README.md the steps to evaluate efficient SPLADE models with PISA.


Cite ๐Ÿ“œ

Please cite our work as:

  • (v1) SIGIR21 short paper
@inbook{10.1145/3404835.3463098,
author = {Formal, Thibault and Piwowarski, Benjamin and Clinchant, St\'{e}phane},
title = {SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking},
year = {2021},
isbn = {9781450380379},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3404835.3463098},
booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2288โ€“2292},
numpages = {5}
}
  • (v2) arxiv
@misc{https://doi.org/10.48550/arxiv.2109.10086,
  doi = {10.48550/ARXIV.2109.10086},
  url = {https://arxiv.org/abs/2109.10086},
  author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, Stรฉphane},
  keywords = {Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval},
  publisher = {arXiv},
  year = {2021},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}
  • (v2bis) SPLADE++, SIGIR22 short paper
@inproceedings{10.1145/3477495.3531857,
author = {Formal, Thibault and Lassance, Carlos and Piwowarski, Benjamin and Clinchant, St\'{e}phane},
title = {From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531857},
doi = {10.1145/3477495.3531857},
abstract = {Neural retrievers based on dense representations combined with Approximate Nearest Neighbors search have recently received a lot of attention, owing their success to distillation and/or better sampling of examples for training -- while still relying on the same backbone architecture. In the meantime, sparse representation learning fueled by traditional inverted indexing techniques has seen a growing interest, inheriting from desirable IR priors such as explicit lexical matching. While some architectural variants have been proposed, a lesser effort has been put in the training of such models. In this work, we build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models, by studying the effect of distillation, hard-negative mining as well as the Pre-trained Language Model initialization. We furthermore study the link between effectiveness and efficiency, on in-domain and zero-shot settings, leading to state-of-the-art results in both scenarios for sufficiently expressive models.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2353โ€“2359},
numpages = {7},
keywords = {neural networks, indexing, sparse representations, regularization},
location = {Madrid, Spain},
series = {SIGIR '22}
}
  • efficient SPLADE, SIGIR22 short paper
@inproceedings{10.1145/3477495.3531833,
author = {Lassance, Carlos and Clinchant, St\'{e}phane},
title = {An Efficiency Study for SPLADE Models},
year = {2022},
isbn = {9781450387323},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531833},
doi = {10.1145/3477495.3531833},
abstract = {Latency and efficiency issues are often overlooked when evaluating IR models based on Pretrained Language Models (PLMs) in reason of multiple hardware and software testing scenarios. Nevertheless, efficiency is an important part of such systems and should not be overlooked. In this paper, we focus on improving the efficiency of the SPLADE model since it has achieved state-of-the-art zero-shot performance and competitive results on TREC collections. SPLADE efficiency can be controlled via a regularization factor, but solely controlling this regularization has been shown to not be efficient enough. In order to reduce the latency gap between SPLADE and traditional retrieval systems, we propose several techniques including L1 regularization for queries, a separation of document/query encoders, a FLOPS-regularized middle-training, and the use of faster query encoders. Our benchmark demonstrates that we can drastically improve the efficiency of these models while increasing the performance metrics on in-domain data. To our knowledge, we propose the first neural models that, under the same computing constraints, achieve similar latency (less than 4ms difference) as traditional BM25, while having similar performance (less than 10% MRR@10 reduction) as the state-of-the-art single-stage neural rankers on in-domain data.},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2220โ€“2226},
numpages = {7},
keywords = {splade, sparse representations, latency, information retrieval},
location = {Madrid, Spain},
series = {SIGIR '22}
}

Contact ๐Ÿ“ญ

Feel free to contact us via Twitter or by mail @ [email protected] !

License

SPLADE Copyright (c) 2021-present NAVER Corp.

SPLADE is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. (see license)

You should have received a copy of the license along with this work. If not, see http://creativecommons.org/licenses/by-nc-sa/4.0/ .

More Repositories

1

billboard.js

๐Ÿ“Š Re-usable, easy interface JavaScript chart library based on D3.js
TypeScript
5,723
star
2

fe-news

FE ๊ธฐ์ˆ  ์†Œ์‹ ํ๋ ˆ์ด์…˜ ๋‰ด์Šค๋ ˆํ„ฐ
5,274
star
3

dust3r

DUSt3R: Geometric 3D Vision Made Easy
Python
3,409
star
4

egjs-flicking

๐ŸŽ  โ™ป๏ธ Everyday 30 million people experience. It's reliable, flexible and extendable carousel.
TypeScript
2,551
star
5

egjs-infinitegrid

A module used to arrange card elements including content infinitely on a grid layout.
TypeScript
1,869
star
6

ngrinder

enterprise level performance testing solution
Java
1,788
star
7

d2codingfont

D2 Coding ๊ธ€๊ผด
1,774
star
8

egjs

Javascript components group that brings easiest and fastest way to build a web application in your way.
JavaScript
922
star
9

biobert-pretrained

BioBERT: a pre-trained biomedical language representation model for biomedical text mining
632
star
10

sqlova

Python
625
star
11

deep-image-retrieval

End-to-end learning of deep visual representations for image retrieval
Python
615
star
12

r2d2

Python
442
star
13

fixture-monkey

Let Fixture Monkey generate test instances including edge cases automatically
Java
440
star
14

egjs-view360

360 integrated viewing solution
TypeScript
438
star
15

kapture

kapture is a file format as well as a set of tools for manipulating datasets, and in particular Visual Localization and Structure from Motion data.
Python
429
star
16

scavenger

A runtime dead code analysis tool
Java
383
star
17

yobi

Project hosting software - Deprecated
Java
379
star
18

roma

RoMa: A lightweight library to deal with 3D rotations in PyTorch.
Python
364
star
19

lispe

An implementation of a full fledged Lisp interpreter with Data Structure, Pattern Programming and High level Functions with Lazy Evaluation ร  la Haskell.
C
357
star
20

lucy-xss-filter

HTML
319
star
21

arcus

ARCUS is the NAVER memcached with lists, sets, maps and b+trees. http://naver.github.io/arcus
Shell
300
star
22

spring-jdbc-plus

Spring JDBC Plus
Java
257
star
23

egjs-grid

A component that can arrange items according to the type of grids
TypeScript
253
star
24

kapture-localization

Provide mapping and localization pipelines based on kapture format
Python
251
star
25

android-imagecropview

android image crop library
Java
250
star
26

smarteditor2

Javascript WYSIWYG HTML editor
JavaScript
241
star
27

lucy-xss-servlet-filter

Java
237
star
28

claf

CLaF: Open-Source Clova Language Framework
Python
215
star
29

eslint-config-naver

Naver JavaScript Coding Conventions rules for eslint
JavaScript
205
star
30

kor2vec

OOV์—†์ด ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ํ•œ๊ตญ์–ด Embedding ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
Python
197
star
31

tamgu

Tamgu (ํƒ๊ตฌ), a FIL programming language: Functional, Imperative, Logical all in one for annotation and data augmentation
C++
186
star
32

nlp-challenge

NLP Shared tasks (NER, SRL) using NSML
Python
176
star
33

nbase-arc

nbase-arc is an open source distributed memory store based on Redis
C
171
star
34

nanumfont

170
star
35

egjs-view3d

Fast & customizable 3D model viewer for everyone
TypeScript
170
star
36

hackday-conventions-java

์บ ํผ์Šค ํ•ต๋ฐ์ด Java ์ฝ”๋”ฉ ์ปจ๋ฒค์…˜
169
star
37

egjs-axes

A module used to change the information of user action entered by various input devices such as touch screen or mouse into the logical virtual coordinates.
TypeScript
150
star
38

cgd

Combination of Multiple Global Descriptors for Image Retrieval
Python
144
star
39

croco

Python
137
star
40

volley-extensions

Volley Extensions v2.0.0. ( Volleyer, Volley requests, Volley caches, Volley custom views )
Java
134
star
41

naver-openapi-guide

CSS
129
star
42

tldr

TLDR is an unsupervised dimensionality reduction method that combines neighborhood embedding learning with the simplicity and effectiveness of recent self-supervised learning losses
Python
120
star
43

fire

Python
119
star
44

grabcutios

Image segmentation using GrabCut algorithm for iOS
C++
118
star
45

sling

C++
117
star
46

gdc

Code accompanying our papers on the "Generative Distributional Control" framework
Python
116
star
47

naveridlogin-sdk-android

๋„ค์ด๋ฒ„ ์•„์ด๋””๋กœ ๋กœ๊ทธ์ธ SDK (์•ˆ๋“œ๋กœ์ด๋“œ)
Kotlin
112
star
48

PoseGPT

Python
106
star
49

egjs-conveyer

Conveyer adds Drag gestures to your Native Scroll.
TypeScript
103
star
50

egjs-agent

Extracts browser and operating system information from the user agent string or user agent object(userAgentData).
TypeScript
100
star
51

spring-batch-plus

Add useful features to spring batch
Kotlin
100
star
52

cfcs

Write once, create framework components that supports React, Vue, Svelte, and more.
TypeScript
98
star
53

searchad-apidoc

Java
96
star
54

dope

Python
91
star
55

multi-hmr

Pytorch demo code and models for Multi-HMR
Python
87
star
56

imagestabilizer

C++
77
star
57

posescript

Python
76
star
58

guitar

AutoIt
76
star
59

arcus-memcached

ARCUS memory cache server
C
69
star
60

disco

A Toolkit for Distributional Control of Generative Models
Python
68
star
61

svc

Easy and intuitive pattern for Android
Kotlin
63
star
62

cover-checker

Check your pull request code coverage
Java
63
star
63

storybook-addon-preview

Storybook Addon Preview can show user selected knobs in various framework code in Storybook
TypeScript
63
star
64

egjs-list-differ

โž•โž–๐Ÿ”„ A module that checks the diff when values are added, removed, or changed in an array.
TypeScript
61
star
65

egjs-imready

I'm Ready to check if the images or videos are loaded!
TypeScript
59
star
66

egjs-flicking-plugins

Plugins for @egjs/flicking
TypeScript
59
star
67

naveridlogin-sdk-ios

Objective-C
58
star
68

clova-face-kit

On-device lightweight face recognition. Available on Android, iOS, WASM, Python.
57
star
69

prism-live-studio

C++
56
star
70

rye

RYE, Native Sharding RDBMS
C
54
star
71

hubblemon

Python
54
star
72

zeplin-flutter-gen

๐Ÿš€The Flutter dart code generator from zeplin. ex) Container, Text, Color, TextStyle, ... - Save your time.
JavaScript
54
star
73

egjs-visible

A class that checks if an element is visible in the base element or viewport.
HTML
52
star
74

aqm-plus

PyTorch code for Large-Scale Answerer in Questioner's Mind for Visual Dialog Question Generation (AQM+) (ICLR 2019)
Python
50
star
75

arcus-java-client

ARCUS Java client
Java
49
star
76

isometrizer

Isometrizer turns your DOM elements into isometric projection
TypeScript
47
star
77

garnet

Python
45
star
78

jindojs-jindo

Jindo JavaScript Framework
JavaScript
44
star
79

artemis

Official code release for ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity (published at ICLR 2022)
Python
42
star
80

covid19-nmt

Multi-lingual & multi-domain (specialisation for biomedical data) translation model
Python
40
star
81

react-sample-code

์ด ํ”„๋กœ์ ํŠธ๋Š” hello world์— ๊ณต๊ฐœํ•œ React ๊ฐœ๋ฐœ ๊ฐ€์ด๋“œ์— ํ•„์š”ํ•œ ์ƒ˜ํ”Œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.
JavaScript
39
star
82

passport-naver

A passport strategy for Naver OAuth 2.0
JavaScript
38
star
83

hadoop

Public hadoop release repository
Java
38
star
84

kaist-oss-course

Introduction to Open Source Software class @ KAIST 2016
38
star
85

pump

Python
38
star
86

egjs-component

A class used to manage events in a component like DOM
TypeScript
38
star
87

graphql-dataloader-mongoose

graphql-dataloader-mongoose is a DataLoader generator based on an existing Mongoose model
TypeScript
38
star
88

egjs-persist

Provide cache interface to handle persisted data among history navigation.
JavaScript
38
star
89

posebert

Python
37
star
90

naverspeech-sdk-ios

Swift
32
star
91

reflect

C++ class reflection library without RTTI.
C++
32
star
92

android-utilset

Utilset is collections of useful functions to save your valuable time.
Java
32
star
93

cafe-sdk-unity

31
star
94

naver-spring-batch-ex

Java
31
star
95

image-maps

jquery plugin which can be partially linked to the image
JavaScript
31
star
96

whale-browser-developers

Documents for Whale browser developers.
28
star
97

ai-hackathon

๋„ค์ด๋ฒ„ AI Hackathon_AI Vision!
Python
28
star
98

image-sprite-webpack-plugin

A webpack plugin that generates spritesheets from your stylesheets.
JavaScript
28
star
99

oasis

Code for the paper "On the Road to Online Adaptation for Semantic Image Segmentation", CVPR 2022
Python
27
star
100

react-native-image-modifier

Modify local images by React-native module
Java
25
star