• Stars
    star
    174
  • Rank 217,791 (Top 5 %)
  • Language
  • Created almost 7 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

From Zero to Learning to Rank in Apache Solr

From Zero to Learning to Rank in Apache Solr

This tutorial describes how to implement a modern learning to rank (LTR) system in Apache Solr. The intended audience is people who have zero Solr experience, but who are comfortable with machine learning and information retrieval concepts. I was one of those people only a couple of months ago, and I found it extremely challenging to get up and running with the Solr materials available on the internet. This is my attempt at writing the tutorial I wish I had discovered.

Table of Contents

Setting Up Solr

Firing up a vanilla Solr instance on Fedora is actually pretty straightforward. First, download the Solr source tarball (so, one containing "src") from here and extract it to a reasonable location. Next, cd into the Solr directory:

cd /path/to/solr-<version>/solr

Building Solr requires Apache Ant and Apache Ivy, so we'll have to install those:

sudo dnf install ant ivy

And now we'll build Solr.

ant ivy-bootstrap
ant server

You can confirm Solr is working by running:

bin/solr start

and making sure you see the Solr Admin interface at http://localhost:8983/solr/. You can stop Solr (but don't stop it now) with:

bin/solr stop

Solr Basics

Solr is a search platform, so we only really need to know how to do two things to function: (1) index data and (2) define a ranking model. Solr has a REST-like API, which means we'll be making changes with the curl command. To get going, let's first create a core named test:

bin/solr create -c test

This seemingly simple command actually did a lot of stuff behind the scenes. Specifically, it defined a schema, which tells Solr how documents should be processed (think tokenization, stemming, etc.) and searched (e.g., using the tf-idf vector space model), and it set up a configuration file, which specifies what libraries and handlers Solr will use. A core can be deleted with:

bin/solr delete -c test

OK, let's add some documents. First download this XML file of tweets provided on the Solr in Action GitHub. Take a look inside the XML file. Notice how it's using an <add> tag to tell Solr to add several documents (denoted with <doc> tags) to the index. To actually index the tweets, we run:

bin/post -c test /path/to/tweets.xml

Now, if we go to http://localhost:8983/solr/ (you might have to refresh) and click on the "Core Selector" dropdown on the left hand side, we can select the test core. If we then click on the "Query" tab, the query interface will appear. If we click on the blue "Execute Query" button at the bottom, a JSON document containing information regarding the tweets we just indexed will be displayed. Congratulations, you just ran your first successful query! Specifically, you used the /select RequestHandler to execute the query *:*. The *:* is a special syntax that tells Solr to return everything. The Solr query syntax is not very intuitive, in my opinion, so it's something you'll just have to get used to.

Defining Features

OK, now that we have a basic Solr instance up and running, let's define some features for our LTR system. Like all machine learning problems, effective feature engineering is critical to success. Standard features in modern LTR models include using multiple similarity measures (e.g., cosine similarity of tf-idf vectors or BM25) to compare multiple text fields (e.g., body, title), in addition to other text characteristics (e.g., length) and document characteristics (e.g., age, PageRank). A good starting point is this list of features put together by Microsoft Research for an academic data set. A list of some other commonly used features can be found on slide 32 of these lecture notes.

To start off, we're going to modify /path/to/solr-<version>/solr/server/solr/test/conf/managed-schema so that it includes the text fields that we'll need for our model. First, we'll change the text field so that it is of the text_general type (which is already defined inside managed-schema). The text_general type will allow us to calculate BM25 similarities. Because the text field already exists (it was automatically created when we indexed the tweets), we need to use the replace-field command like so:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field" : {
     "name":"text",
     "type":"text_general",
     "indexed":"true",
     "stored":"true",
     "multiValued":"true"}
}' http://localhost:8983/solr/test/schema

I encourage you to take a look inside managed-schema following each change so that you can get a sense for what's happening. Next, we're going to specify a text_tfidf type, which will allow us to calculate tf-idf cosine similarities:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field-type" : {
     "name":"text_tfidf",
     "class":"solr.TextField",
     "positionIncrementGap":"100",
     "indexAnalyzer":{
        "tokenizer":{
           "class":"solr.StandardTokenizerFactory"},
        "filter":{
           "class":"solr.StopFilterFactory",
           "ignoreCase":"true",
           "words":"stopwords.txt"},
        "filter":{
           "class":"solr.LowerCaseFilterFactory"}},
     "queryAnalyzer":{
        "tokenizer":{
           "class":"solr.StandardTokenizerFactory"},
        "filter":{
           "class":"solr.StopFilterFactory",
           "ignoreCase":"true",
           "words":"stopwords.txt"},
        "filter":{
           "class":"solr.SynonymGraphFilterFactory",
           "ignoreCase":"true",
           "synonyms":"synonyms.txt"},
        "filter":{
           "class":"solr.LowerCaseFilterFactory"}},
     "similarity":{
           "class":"solr.ClassicSimilarityFactory"}}
}' http://localhost:8983/solr/test/schema

Let's now add a text_tfidf field that will be of the text_tfidf type we just defined:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field" : {
     "name":"text_tfidf",
     "type":"text_tfidf",
     "indexed":"true",
     "stored":"false",
     "multiValued":"true"}
}' http://localhost:8983/solr/test/schema

Because the contents of the text field and the text_tfidf field are the same (we're just handling them differently), we will tell Solr to copy the contents from text to text_tfidf:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-copy-field" : {
     "source":"text",
     "dest":"text_tfidf"}
}' http://localhost:8983/solr/test/schema

We're now ready to re-index our data:

bin/post -c test /path/to/tweets.xml

Learning to Rank

Now that our documents are properly indexed, let's build a LTR model. If you're new to LTR, I recommend checking out this (long) paper by Tie-Yan Liu and this textbook also by Liu. If you're familiar with machine learning, the ideas shouldn't be too difficult to grasp. I also recommend checking out the Solr documentation on LTR, which I'll be linking to throughout this section. Enabling LTR in Solr first requires making some changes to /path/to/solr-<version>/solr/server/solr/test/conf/solrconfig.xml. Copy and paste the below text anywhere between the <config> and </config> tags (at the top and bottom of the file, respectively).

<lib dir="${solr.install.dir:../../../..}/contrib/ltr/lib/" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-ltr-\d.*\.jar" />

<queryParser name="ltr" class="org.apache.solr.ltr.search.LTRQParserPlugin"/>

<cache name="QUERY_DOC_FV"
       class="solr.search.LRUCache"
       size="4096"
       initialSize="2048"
       autowarmCount="4096"
       regenerator="solr.search.NoOpRegenerator" />

<transformer name="features" class="org.apache.solr.ltr.response.transform.LTRFeatureLoggerTransformerFactory">
  <str name="fvCacheName">QUERY_DOC_FV</str>
</transformer>

We're now ready to run Solr with LTR enabled. First, stop Solr:

bin/solr stop

and then restart it with the LTR plugin enabled:

bin/solr start -Dsolr.ltr.enabled=true

Next, we need to push the model features and the model specification to Solr. In Solr, LTR features are defined using a JSON formatted file. For our model, we'll save the following features in my_efi_features.json:

[
  {
    "store" : "my_efi_feature_store",
    "name" : "tfidf_sim_a",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf=text_tfidf}${text_a}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "tfidf_sim_b",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf=text_tfidf}${text_b}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "bm25_sim_a",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf=text}${text_a}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "bm25_sim_b",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf=text}${text_b}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "max_sim",
    "class" : "org.apache.solr.ltr.feature.SolrFeature",
    "params" : { "q" : "{!dismax qf='text text_tfidf'}${text}" }
  },
  {
    "store" : "my_efi_feature_store",
    "name" : "original_score",
    "class" : "org.apache.solr.ltr.feature.OriginalScoreFeature",
    "params" : {}
  }
]

store tells Solr where to store the feature, name is the name of the feature, class specifies which Java class will handle the feature, and params provides additional information about the feature required by its Java class. In the case of a SolrFeature, you need to provide the query. {!dismax qf=text_tfidf}${text_a} tells Solr to search the text_tfidf field with the contents of text_a using the DisMaxQParser. The reason we're using the DisMax parser instead of the seemingly more obvious FieldQParser (e.g., {!field f=text_tfidf}${text_a}) is because the FieldQParser automatically converts multi-term queries to "phrases" (i.e., it converts something like "the cat in the hat" into, effectively, "the_cat_in_the_hat", rather than "the", "cat", "in", "the", "hat"). This FieldQParser behavior (which seems like a rather strange default to me) ended up giving me quite a headache, but I eventually found a solution with DisMaxQParser.

{!dismax qf='text text_tfidf'}${text} tells Solr to search both the text and text_tfidf fields with the contents of text and then take the max of those two scores. While this feature doesn't really make sense in this context because we're already using similarities from both fields as features, it demonstrates how such a feature could be implemented. For example, imagine that the documents in your corpus are linked to, at most, five other sources of text data. It might make sense to incorporate that information during a search, and taking the max over multiple similarity scores is one way of doing that.

Finally, OriginalScoreFeature "returns the original score that the document had before performing the reranking". This feature is necessary for returning the results in their original ranking when extracting features (note: OriginalScoreFeature is broken on Solr versions prior to 7.1).

To push the features to Solr, we run the following command:

curl -XPUT 'http://localhost:8983/solr/test/schema/feature-store' --data-binary "@/path/to/my_efi_features.json" -H 'Content-type:application/json'

If you ever want to upload new features, you have to first delete the old features with:

curl -XDELETE 'http://localhost:8983/solr/test/schema/feature-store/my_efi_feature_store'

Next, we'll save the following model specification in my_efi_model.json:

{
  "store" : "my_efi_feature_store",
  "name" : "my_efi_model",
  "class" : "org.apache.solr.ltr.model.LinearModel",
  "features" : [
    { "name" : "tfidf_sim_a" },
    { "name" : "tfidf_sim_b" },
    { "name" : "bm25_sim_a" },
    { "name" : "bm25_sim_b" },
    { "name" : "max_sim" },
    { "name" : "original_score" }
  ],
  "params" : {
    "weights" : {
      "tfidf_sim_a" : 0.0,
      "tfidf_sim_b" : 0.0,
      "bm25_sim_a" : 0.0,
      "bm25_sim_b" : 0.0,
      "max_sim" : 0.0,
      "original_score" : 1.0
    }
  }
}

store specifies where the features the model is using are stored, name is the name of the model, class specifies which Java class will implement the model, features is a list of the model features, and params provides additional information required by the model's Java class. To start off with, we'll use the LinearModel, which simply takes a weighted sum of the feature values to generate a score. Here, we assign a weight of 0.0 to each feature except original_score, which is assigned a weight of 1.0. This weighting scheme will ensure the results are returned in their original order. To find better weights, we'll need to extract training data from Solr. I'll go over this topic in more depth in the RankNet section.

We can push the model to Solr with:

curl -XPUT 'http://localhost:8983/solr/test/schema/model-store' --data-binary "@/path/to/my_efi_model.json" -H 'Content-type:application/json'

And now we're ready to run our first LTR query:

http://localhost:8983/solr/test/query?q=historic north&df=text&rq={!ltr model=my_efi_model efi.text_a=historic efi.text_b=north efi.text='historic north'}&fl=id,score,[features]

You should see something like:

{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"historic north",
      "df":"text",
      "fl":"id,score,[features]",
      "rq":"{!ltr model=my_efi_model efi.text_a=historic efi.text_b=north efi.text='historic north'}"}},
  "response":{"numFound":1,"start":0,"maxScore":1.8617721,"docs":[
      {
        "id":"1",
        "score":1.8617721,
        "[features]":"tfidf_sim_a=0.35304558,tfidf_sim_b=0.0,bm25_sim_a=0.93088603,bm25_sim_b=0.93088603,max_sim=1.8617721,original_score=1.8617721"}]
  }}

Referring back to the request, q=historic north is the query used to fetch the initial results (using BM25 in this case), which are then re-ranked with the LTR model. df=text specifies the default field for Solr to search. rq is where all of the LTR parameters are provided. efi stands for "external feature information", which allows you to specify additional inputs at query time. In this case, we're populating the text_a argument with the term historic, the text_b argument with the term north, and the text argument with the multi-term query 'historic north' (note, this is not being treated as a "phrase"). fl=id,score,[features] tells Solr to include the id, score, and model features in the results. You can verify that the feature values are correct by performing the associated search in the "Query" interface of the Solr Admin UI. For example, typing text_tfidf:historic in the q text box and typing score in the fl text box and then clicking the "Execute Query" button should return a value of 0.35304558.

RankNet

For LTR systems, linear models are generally trained using what's called a "pointwise" approach, which is where documents are considered individually (i.e., the model asks, "Is this document relevant to the query or not?"); however, pointwise approaches are generally not well-suited for LTR problems. RankNet is a neural network that uses a "pairwise" approach, which is where documents with a known relative preference are considered in pairs (i.e., the model asks, "Is document A more relevant than document B for the query or not?"). RankNet is available in Solr as of version 7.3 (you can verify your version of Solr includes RankNet by inspecting /path/to/solr-<version>/solr/dist/solr-ltr-{version}-SNAPSHOT.jar and looking for NeuralNetworkModel.class under /org/apache/solr/ltr/model/). I've also implemented RankNet in Keras for model training. It's worth noting that LambdaMART might be more appropriate for your particular search application. However, RankNet can be trained quickly on a GPU using Keras, which makes it a good solution for search problems where only one document is relevant to any given query. For a nice (technical) overview of RankNet, LambdaRank, and LambdaMART, see this paper by Chris Burges from (at the time) Microsoft Research.

Unfortunately, the suggested method of feature extraction in Solr is painfully slow (other Solr users seem to agree it could be faster). Even when making the requests in parallel, it took me almost three days to extract features for ~200,000 queries. I think a better approach might be to do something like this, where you index the queries and then calculate the similarities between the "documents" (which consist of the true documents and queries), but this is really something that should be baked into Solr. Anyway, here is some Python template code for extracting features from Solr using queries (note: this code cannot be run as is):

import numpy as np
import requests
import simplejson

# Number of documents to be re-ranked.
RERANK = 50
with open("RERANK.int", "w") as f:
    f.write(str(RERANK))

# Build query URL.
q_id = row["id"]
text_a = row["text_a"].strip().lower()
text_b = row["text_b"].strip().lower()
text = " ".join([text_a, text_b])

url = "http://localhost:8983/solr/test/query"
url += "?q={0}&df=text&rq={{!ltr model=my_efi_model ".format(text)
url += "efi.text_a='{0}' efi.text_b='{1}' efi.text='{2}'}}".format(text_a, text_b, text)
url += "&fl=id,score,[features]&rows={1}".format(text, RERANK)

# Get response and check for errors.
response = requests.request("GET", url)
try:
    json = simplejson.loads(response.text)
except simplejson.JSONDecodeError:
    print(q_id)

if "error" in json:
    print(q_id)

# Extract the features.
results_features = []
results_targets = []
results_ranks = []
add_data = False

for (rank, document) in enumerate(json["response"]["docs"]):

    features = document["[features]"].split(",")
    feature_array = []
    for feature in features:
        feature_array.append(feature.split("=")[1])

    feature_array = np.array(feature_array, dtype = "float32")
    results_features.append(feature_array)

    doc_id = document["id"]
    # Check if document is relevant to query.
    if q_id in relevant.get(doc_id, {}):
        results_ranks.append(rank + 1)
        results_targets.append(1)
        add_data = True
    else:
        results_targets.append(0)

if add_data:
    np.save("{0}_X.npy".format(q_id), np.array(results_features))
    np.save("{0}_y.npy".format(q_id), np.array(results_targets))
    np.save("{0}_rank.npy".format(q_id), np.array(results_ranks))

We're now ready to train some models. To start off with, we'll pull in the data and evaluate the BM25 rankings on the entire data set.

import glob
import numpy as np

rank_files = glob.glob("*_rank.npy")
suffix_len = len("_rank.npy")

RERANK = int(open("RERANK.int").read())

ranks = []
casenumbers = []
Xs = []
ys = []
for rank_file in rank_files:
    X = np.load(rank_file[:-suffix_len] + "_X.npy")
    casenumbers.append(rank_file[:suffix_len])
    if X.shape[0] != RERANK:
        print(rank_file[:-suffix_len])
        continue

    rank = np.load(rank_file)[0]
    ranks.append(rank)
    y = np.load(rank_file[:-suffix_len] + "_y.npy")
    Xs.append(X)
    ys.append(y)

ranks = np.array(ranks)
total_queries = len(ranks)
print("Total Queries: {0}".format(total_queries))
print("Top 1: {0}".format((ranks == 1).sum() / total_queries))
print("Top 3: {0}".format((ranks <= 3).sum() / total_queries))
print("Top 5: {0}".format((ranks <= 5).sum() / total_queries))
print("Top 10: {0}".format((ranks <= 10).sum() / total_queries))

Next, we'll build and evaluate a (pointwise) linear support vector machine.

from scipy.stats import rankdata
from sklearn.svm import LinearSVC

X = np.concatenate(Xs, 0)
y = np.concatenate(ys)

train_per = 0.8
train_cutoff = int(train_per * len(ranks)) * RERANK
train_X = X[:train_cutoff]
train_y = y[:train_cutoff]
test_X = X[train_cutoff:]
test_y = y[train_cutoff:]

model = LinearSVC()
model.fit(train_X, train_y)
preds = model._predict_proba_lr(test_X)

n_test = int(len(test_y) / RERANK)
new_ranks = []
for i in range(n_test):
    start = i * RERANK
    end = start + RERANK
    scores = preds[start:end, 1]
    score_ranks = rankdata(-scores)
    old_rank = np.argmax(test_y[start:end])
    new_rank = score_ranks[old_rank]
    new_ranks.append(new_rank)

new_ranks = np.array(new_ranks)
print("Total Queries: {0}".format(n_test))
print("Top 1: {0}".format((new_ranks == 1).sum() / n_test))
print("Top 3: {0}".format((new_ranks <= 3).sum() / n_test))
print("Top 5: {0}".format((new_ranks <= 5).sum() / n_test))
print("Top 10: {0}".format((new_ranks <= 10).sum() / n_test))

Now we can try out RankNet. First we'll assemble the training data so that each row consists of a relevant document vector concatenated with an irrelevant document vector (for a given query). Because we returned 50 rows in the feature extraction phase, each query will have 49 document pairs in the data set.

Xs = []
for rank_file in rank_files:
    X = np.load(rank_file[:-suffix_len] + "_X.npy")
    if X.shape[0] != RERANK:
        print(rank_file[:-suffix_len])
        continue

    rank = np.load(rank_file)[0]
    pos_example = X[rank - 1]
    for (i, neg_example) in enumerate(X):
        if i == rank - 1:
            continue
        Xs.append(np.concatenate((pos_example, neg_example)))

X = np.stack(Xs)
dim = int(X.shape[1] / 2)

train_per = 0.8
train_cutoff = int(train_per * len(ranks)) * (RERANK - 1)

train_X = X[:train_cutoff]
test_X = X[train_cutoff:]

Here, we build the model in Keras.

from keras import backend
from keras.callbacks import ModelCheckpoint
from keras.layers import Activation, Add, Dense, Input, Lambda
from keras.models import Model

y = np.ones((train_X.shape[0], 1))

INPUT_DIM = 5
h_1_dim = 64
h_2_dim = h_1_dim // 2
h_3_dim = h_2_dim // 2

# Model.
h_1 = Dense(h_1_dim, activation = "relu")
h_2 = Dense(h_2_dim, activation = "relu")
h_3 = Dense(h_3_dim, activation = "relu")
s = Dense(1)

# Relevant document score.
rel_doc = Input(shape = (INPUT_DIM, ), dtype = "float32")
h_1_rel = h_1(rel_doc)
h_2_rel = h_2(h_1_rel)
h_3_rel = h_3(h_2_rel)
rel_score = s(h_3_rel)

# Irrelevant document score.
irr_doc = Input(shape = (INPUT_DIM, ), dtype = "float32")
h_1_irr = h_1(irr_doc)
h_2_irr = h_2(h_1_irr)
h_3_irr = h_3(h_2_irr)
irr_score = s(h_3_irr)

# Subtract scores.
negated_irr_score = Lambda(lambda x: -1 * x, output_shape = (1, ))(irr_score)
diff = Add()([rel_score, negated_irr_score])

# Pass difference through sigmoid function.
prob = Activation("sigmoid")(diff)

# Build model.
model = Model(inputs = [rel_doc, irr_doc], outputs = prob)
model.compile(optimizer = "adagrad", loss = "binary_crossentropy")

And now to train and test the model.

NUM_EPOCHS = 30
BATCH_SIZE = 32
checkpointer = ModelCheckpoint(filepath = "valid_params.h5", verbose = 1, save_best_only = True)
history = model.fit([train_X[:, :dim], train_X[:, dim:]], y,
                     epochs = NUM_EPOCHS, batch_size = BATCH_SIZE, validation_split = 0.05,
                     callbacks = [checkpointer], verbose = 2)

model.load_weights("valid_params.h5")
get_score = backend.function([rel_doc], [rel_score])
n_test = int(test_X.shape[0] / (RERANK - 1))
new_ranks = []
for i in range(n_test):
    start = i * (RERANK - 1)
    end = start + (RERANK - 1)
    pos_score = get_score([test_X[start, :dim].reshape(1, dim)])[0]
    neg_scores = get_score([test_X[start:end, dim:]])[0]

    scores = np.concatenate((pos_score, neg_scores))
    score_ranks = rankdata(-scores)
    new_rank = score_ranks[0]
    new_ranks.append(new_rank)

new_ranks = np.array(new_ranks)
print("Total Queries: {0}".format(n_test))
print("Top 1: {0}".format((new_ranks == 1).sum() / n_test))
print("Top 3: {0}".format((new_ranks <= 3).sum() / n_test))
print("Top 5: {0}".format((new_ranks <= 5).sum() / n_test))
print("Top 10: {0}".format((new_ranks <= 10).sum() / n_test))

# Compare to BM25.
old_ranks = ranks[-n_test:]
print("Total Queries: {0}".format(n_test))
print("Top 1: {0}".format((old_ranks == 1).sum() / n_test))
print("Top 3: {0}".format((old_ranks <= 3).sum() / n_test))
print("Top 5: {0}".format((old_ranks <= 5).sum() / n_test))
print("Top 10: {0}".format((old_ranks <= 10).sum() / n_test))

If the model's results are satisfactory, we can save the parameters to a JSON file to be pushed to Solr:

import json

weights = model.get_weights()
solr_model = {"store" : "my_efi_feature_store",
              "name" : "my_ranknet_model",
              "class" : "org.apache.solr.ltr.model.NeuralNetworkModel",
              "features" : [
                { "name" : "tfidf_sim_a" },
                { "name" : "tfidf_sim_b" },
                { "name" : "bm25_sim_a" },
                { "name" : "bm25_sim_b" },
                { "name" : "max_sim" }
              ],
              "params": {}}
layers = []
layers.append({"matrix": weights[0].T.tolist(),
               "bias": weights[1].tolist(),
               "activation": "relu"})
layers.append({"matrix": weights[2].T.tolist(),
               "bias": weights[3].tolist(),
               "activation": "relu"})
layers.append({"matrix": weights[4].T.tolist(),
              "bias": weights[5].tolist(),
              "activation": "relu"})
layers.append({"matrix": weights[6].T.tolist(),
              "bias": weights[7].tolist(),
              "activation": "identity"})
solr_model["params"]["layers"] = layers

with open("my_ranknet_model.json", "w") as out:
    json.dump(solr_model, out, indent = 4)

and it's pushed the same as before:

curl -XPUT 'http://localhost:8983/solr/test/schema/model-store' --data-binary "@/path/to/my_ranknet_model.json" -H 'Content-type:application/json'

We can also perform an LTR query like before, except this time we'll use ltr_model=my_ranknet_model.

http://localhost:8983/solr/test/query?q=historic north&df=text&rq={!ltr model=my_ranknet_model efi.text_a=historic efi.text_b=north efi.text='historic north'}&fl=id,score,[features]

And there you have it — a modern learning to rank setup in Apache Solr.

More Repositories

1

Deep-Semantic-Similarity-Model

My Keras implementation of the Deep Semantic Similarity Model (DSSM)/Convolutional Latent Semantic Model (CLSM) described here: http://research.microsoft.com/pubs/226585/cikm2014_cdssm_final.pdf.
Python
519
star
2

Michael-s-Data-Science-Curriculum

This is the companion curriculum to my guide to becoming a data scientist.
395
star
3

RankNet

My (slightly modified) Keras implementation of RankNet and PyTorch implementation of LambdaRank.
Python
246
star
4

pytorch-nerf

Minimal PyTorch implementations of NeRF and pixelNeRF.
Python
226
star
5

Recurrent-Convolutional-Neural-Network-Text-Classifier

My (slightly modified) Keras implementation of the Recurrent Convolutional Neural Network (RCNN) described here: http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9745.
Python
184
star
6

batter-pitcher-2vec

A model for learning distributed representations of MLB players.
Jupyter Notebook
80
star
7

strike-with-a-pose

A simple GUI tool for generating adversarial poses of objects.
Python
77
star
8

baller2vec

A multi-entity Transformer for multi-agent spatiotemporal modeling.
Python
63
star
9

Michael-s-Guide-to-Becoming-a-Data-Scientist

I was once asked about transitioning to a career in data science by three different UChicago grad students over a short period of time, so I decided to put together this outline in case anyone else was curious.
40
star
10

baller2vecplusplus

A look-ahead multi-entity Transformer for modeling coordinated agents.
Python
38
star
11

pytorch-geodesic-loss

A PyTorch criterion for computing the distance between rotation matrices.
Python
31
star
12

Color-Names

An improved version of the color name model described here: http://lewisandquark.tumblr.com/post/160776374467/new-paint-colors-invented-by-neural-network.
Python
24
star
13

vqvae-pytorch

A minimal PyTorch implementation of the VQ-VAE model described in "Neural Discrete Representation Learning".
Python
21
star
14

LMIR

Pure Python implementations of the language models for information retrieval surveyed here: https://dl.acm.org/doi/10.1145/383952.384019.
Python
13
star
15

Football-o-Genetics

An application for "evolving" near-optimal offensive play calling strategies.
Java
10
star
16

paved2paradise

Cost-effective and scalable LiDAR simulation by factoring the real world.
Python
8
star
17

deformer

An order-agnostic distribution estimating Transformer.
Python
6
star
18

shallow-deep-learning

The code and slides for my "A Shallow Introduction to Deep Learning" workshop.
Python
6
star
19

pytorch-ipdf

Minimal PyTorch implementation of implicit-PDF.
Python
5
star
20

Hangouts-NLP

A program that performs a number of different natural language processing analyses on Google Hangouts instant messaging data.
Python
5
star
21

Sequences-With-Sentences

A convolutional recurrent neural network that can handle data sequences containing a mixture of fixed size and variable size (e.g., text) inputs at each time step.
Python
4
star
22

ScatterPlot3D

An application for visualizing and exploring three-dimensional scatter plot data.
Java
2
star
23

boformer

Python
2
star
24

pytorch-volume-rotator

Applies explicit 3D transformations to feature volumes in PyTorch.
Python
2
star
25

aquamam

An autoregressive, quaternion manifold model for rapidly estimating complex SO(3) distributions.
Python
1
star
26

parking-lot-pointnetplusplus

Train a PointNet++ bounding box regression model on parking lot samples obtained by following the Paved2Paradise protocol.
Python
1
star