• Stars
    star
    408
  • Rank 105,946 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Behavioral "black-box" testing for recommender systems

RecList

Rocket Emoji

Documentation Status Contributors License Downloads YouTube

Overview

RecList is an open source library providing behavioral, "black-box" testing for recommender systems. Inspired by the pioneering work of Ribeiro et al. 2020 in NLP, we introduce a general plug-and-play procedure to scale up behavioral testing, with an easy-to-extend interface for custom use cases.

While quantitative metrics over held-out data points are important, a lot more tests are needed for recommenders to properly function in the wild and not erode our confidence in them: for example, a model may boast an accuracy improvement over the entire dataset, but actually be significantly worse than another on rare items or new users; or again, a model that correctly recommends HDMI cables as add-on for shoppers buying a TV, may also wrongly recommend TVs to shoppers just buying a cable.

RecList goal is to operationalize these important intuitions into a practical package for testing research and production models in a more nuanced way, without requiring unnecessary custom code and ad hoc procedures.

If you are not familiar with the library, we suggest first taking our small tour to get acquainted with the main abstractions through ready-made models and tests.

Colab Tutorials

Name Link
Tutorial 101 - Introduction to Reclist Open In Colab
Tutorial - RecList at EvalRS2023 (KDD) Open In Colab
Tutorial - FashionCLIP Evaluation with RecList Open In Colab

Quick Links

Status

  • RecList is free software released under the MIT license, and it has been adopted by popular open-source data challenges.
  • After a major API re-factoring, RecList is now in beta.

Summary

This doc is structured as follows:

Quick Start

You can take a quick tour online using our colab notebook. If you want to use RecList locally, clone the repository, create and activate a virtual env, and install the required packages from pip (you can also install from root of course).

git clone https://github.com/jacopotagliabue/reclist
cd reclist
python3 -m venv venv
source venv/bin/activate
pip install reclist
cd examples
python dummy.py

The sample script will run a suite of tests on a dummy dataset and model, showcasing a typical workflow with the library. Note the commented arguments in the script, which you can use to customize the behavior of the library once you familiarize yourself with the basic patterns (e.g. using S3 to store the plots, leveraging a third-party tool to track experiments).

Once your development setup is working as expected, you can run

python evalrs_2023.py

to explore tests on a real-world dataset (make sure the files are available in the examples folder before you run the script). Finally, once you've run successfully the sample scripts, take the guided tour below to learn more about the abstractions and the full capabilities of RecList.

A Guided Tour

An instance of RecList represents a suite of tests for recommender systems.

As evalrs_2023.py shows, we leave users quite a wide range of options: we provide out of the box standard metrics in case your dataset is DataFrame-shaped (or you can / wish turn it into such a shape), but don't force you any pattern if you just want to use RecList for the scaffolding it provides.

For example, the following code only assumes you have a dataset with golden labels, predictions, and metadata (e.g. item features) in the shape of a DataFrame:

cdf = DFSessionRecList(
    dataset=df_events,
    model_name="myDataFrameRandomModel",
    predictions=df_predictions,
    y_test=df_dataset,
    logger=LOGGER.LOCAL,
    metadata_store= METADATA_STORE.LOCAL,
    similarity_model=my_sim_model,
)

cdf(verbose=True)

Our library pre-packages standard recSys metrics and important behavioral tests, but it is built with extensibility in mind: you can re-use tests in new suites, or you can write new domain-specific suites and tests. Any suite must inherit from the main interface, and then declare its tests as functions decorated with @rec_test.

In the example, an instance is created with one slice-based test: the decorator and return type are used to automatically generate a chart.

class MyRecList(RecList):

    @rec_test(test_type="AccuracyByCountry", display_type=CHART_TYPE.BARS)
    def accuracy_by_country(self):
        """
        Compute the accuracy by country

        NOTE: the accuracy here is just a random number.
        """
        from random import randint
        return {"US": randint(0, 100), "CA": randint(0, 100), "FR": randint(0, 100) }

Inheritance is powerful, as we can build new suites by re-using existing ones. Here, we inherit the tests from an existing "parent" list and just add one more to create a new suite:

class ChildRecList(MyParentRecList):

    @rec_test(test_type='custom_test', display_type=CHART_TYPE.SCALAR)
    def my_test(self):
        """
        Custom test, returning my lucky number as an example
        """
        from random import randint

        return { "luck_number": randint(0, 100) }

Any model can be tested, as no assumption is made on the model's structure, but only the availability of predictions and ground truth. Once again, while our example leverages a DataFrame-shaped dataset for these entities, you are free to build your own RecList instance with any shape you prefer, provided you implement the metrics accordingly (see dummy.py for an example with different input types).

Once you run a suite of tests, results are dumped automatically and versioned in a folder (local or on S3), structured as follows (name of the suite, name of the model, run timestamp):

.reclist/
  myList/
    myModel/
      1637357392/
      1637357404/

If you use RecList as part of your standard testings - either for research or production purposes - you can use the JSON report for machine-to-machine communication with downstream systems (e.g. you may want to automatically fail the pipeline if tests are not passed).

Capabilities

RecList provides a dataset and model agnostic framework to scale up behavioral tests. We provide some suggested abstractions based on DataFrames to make existing tests and metrics fully re-usable, but we don't force any pattern on the user. As out-of-the box functionality, the package provides:

  • tests and metrics to be used on your own datasets and models;
  • automated storage of results, with versioning, both in a local folder or on S3;
  • flexible, Python interface to declare tests-as-functions, and annotate them with display_type for automated charts;
  • pre-built connectors with popular experiment trackers (e.g. Neptune, Comet), and an extensible interface to add your own (see below);
  • reference implementations based on popular data challenges that used RecList: for an example of the "less wrong" latent space metric you can check the song2vec implementation here.

Using Third-Party Tracking Tools

RecList supports streaming the results of your tests directly to your cloud platform of choice, both as metrics and charts.

If you have the Python client installed, you can use the Neptune logger by simply specifying it at init time, and either passing NEPTUNE_KEY and NEPTUNE_PROJECT_NAME as kwargs, or setting them as environment variables.

cdf = DFSessionRecList(
    dataset=df_events,
    model_name="myDataFrameRandomModel",
    predictions=df_predictions,
    y_test=df_dataset,
    logger=LOGGER.NEPTUNE,
    metadata_store= METADATA_STORE.LOCAL,
    similarity_model=my_sim_model
)

cdf(verbose=True)

If you have the Python client installed, you can use the Comet logger by simply specifying it at init time, and either passing COMET_KEY, COMET_PROJECT_NAME, COMET_WORKSPACE as kwargs, or setting them as environment variables.

cdf = DFSessionRecList(
    dataset=df_events,
    model_name="myDataFrameRandomModel",
    predictions=df_predictions,
    y_test=df_dataset,
    logger=LOGGER.COMET,
    metadata_store= METADATA_STORE.LOCAL,
    similarity_model=my_sim_model
)

cdf(verbose=True)

If you wish to add a new platform, you can do so by simply implementing a new class inheriting from RecLogger.

Acknowledgments

The original authors are:

RecList is a community project made possible by the generous support of awesome folks. Between June and December 2022, the development of our beta has been supported by Comet, Neptune , Gantry. Our beta has been developed with the help of:

If you have questions or feedback, please reach out to: jacopo dot tagliabue at nyu dot edu.

License and Citation

All the code is released under an open MIT license. If you found RecList useful, please cite our WWW paper:

@inproceedings{10.1145/3487553.3524215,
    author = {Chia, Patrick John and Tagliabue, Jacopo and Bianchi, Federico and He, Chloe and Ko, Brian},
    title = {Beyond NDCG: Behavioral Testing of Recommender Systems with RecList},
    year = {2022},
    isbn = {9781450391306},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3487553.3524215},
    doi = {10.1145/3487553.3524215},
    pages = {99–104},
    numpages = {6},
    keywords = {recommender systems, open source, behavioral testing},
    location = {Virtual Event, Lyon, France},
    series = {WWW '22 Companion}
}

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

More Repositories

1

you-dont-need-a-bigger-boat

An end-to-end implementation of intent prediction with Metaflow and other cool tools
Python
835
star
2

MLSys-NYU-2022

Slides, scripts and materials for the Machine Learning in Finance Course at NYU Tandon, 2022
Jupyter Notebook
359
star
3

recs-at-resonable-scale

Recommendations at "Reasonable Scale": joining dataOps with recSys through dbt, Merlin and Metaflow
Python
224
star
4

post-modern-stack

Joining the modern data stack with the modern ML stack
Python
187
star
5

foundation-models-for-dbt-entity-matching

Playground for using large language models into the Modern Data Stack for entity matching
Python
105
star
6

FREE_7773

Materials for my 2021 NYU class on NLP and ML Systems (Master of Engineering).
Jupyter Notebook
96
star
7

paas-data-ingestion

Ingesting data with Pulumi, AWS lambdas and Snowflake in a scalable, fully replayable manner
PLpgSQL
66
star
8

tensorflow_to_lambda_serverless

Serve tensorflow models prediction from AWS lambda endpoints
Python
58
star
9

no-ops-machine-learning

A PaaS End-to-End ML Setup with Metaflow, Serverless and SageMaker.
Python
36
star
10

dag-card-is-the-new-model-card

Template-based generation of DAG cards from Metaflow classes, inspired by Google cards for machine learning models.
Python
29
star
11

retail-personalization-workshop

In-Session Personalization Workshop for eCommerce, April 2021, and the MICES Workshop in June 2021.
Jupyter Notebook
21
star
12

anki-drive-python-sdk

Python+node wrapper to read/send message from/to Anki Overdrive bluetooth vehicles.
Python
17
star
13

clothes-in-space

Personalization with deep learning in 100 lines of code
Jupyter Notebook
14
star
14

pixel_from_lambda

Serve a 1x1 GIF pixel from an AWS lambda-powered endpoint
Python
13
star
15

MLSys-NYU-2023

Slides, scripts and materials for the Machine Learning in Finance course at NYU Tandon, 2023.
Jupyter Notebook
12
star
16

spark_tree2lambda

Python micro-service to serve a decision tree trained with Spark through AWS Lambda
Jupyter Notebook
9
star
17

session-path

SessionPath is a deep learning model that provides personalized category suggestions for type-ahead APIs. This repo re-implements the original paper (https://arxiv.org/abs/2005.12781) leveraging Ludwig capabilities.
Python
6
star
18

tarski-2.0

Old-style computational semantics at the time of Python 3.6
Python
5
star
19

magic-the-gpthering

Playground for generating cards in the style of "Magic The Gathering" using generative AI
Python
4
star
20

webppl_to_lambda_serverless

Deploying a webppl probabilistic program as an (AWS lambda) endpoint.
JavaScript
4
star
21

On-the-plurality-of-graphs

WIP code for the "on the plurality of graphs" paper
Jupyter Notebook
3
star
22

jacopotagliabue.github.io

Personal website
2
star
23

how-much-is-a-billion

Generating meaningful perspectives with NLP and Probabilistic Programming.
JavaScript
2
star