• Stars
    star
    460
  • Rank 95,202 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 1 year ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automated Evaluation of RAG Systems

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Paper: ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

To implement ARES for scoring your RAG system and comparing to other RAG configurations, you need three components: ​

  • A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal.
  • A set of few-shot examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system
  • A much larger set of unlabeled query-document-answer triples outputted by your RAG system for scoring

The ARES training pipeline is three steps: ​

  1. Generate synthetic queries and answers from in-domain passages
  2. Prepare LLM judges for scoring RAG system by fine-tuning on synthetically-generated training data
  3. Deploy the prepared LLM judges to evaluate your RAG system across key performance metrics

Note: We also allow users to skip Steps #1 and #2 deploying a zero/few-shot LLM-as-a-Judge ​

Installation

​ To install the necessary dependencies, run the following commands: ​

conda create -n llm_judge python=3.10 --yes
conda activate llm_judge
pip install -r requirements.txt

​ Additionally, you will need to initialize an OpenAI API key with the following command:

export OPENAI_API_KEY=<your key here>

​

Step #1: Synthetic Data Generation

​ To generate synthetic training data, use LLM-as-a-Judge_Adaptation/Generate_Synthetic_Queries_and_Answers.py. Replace items in the following command with your dataset and configuration: ​

python LLM-as-a-Judge_Adaptation/Generate_Synthetic_Queries_and_Answers.py \
       --document_filepath <document_filepath> \
       --few_shot_prompt_filename <few_shot_prompt_filename> \
       --synthetic_queries_filename <synthetic_queries_filename> \
       --documents_sampled 10000

Example:

python LLM-as-a-Judge_Adaptation/Generate_Synthetic_Queries_and_Answers.py \
       --document_filepath example_files/document_filepath.tsv \
       --few_shot_prompt_filename example_files/few_shot_prompt_filename.tsv \
       --synthetic_queries_filename output/synthetic_queries_1.tsv \
       --documents_sampled 10000

This script will output a filepath to the generated synthetic queries for the next step. ​

Note: For examples files for document_filepath and few_shot_prompt_filename, please see example_files. ​

Step #2: Fine-tune LLM-as-a-Judge

​ With the generated file under synthetic_queries_filename from the previous step, use LLM-as-a-Judge_Adaptation/General_Binary_Classifier.py to train your LLM-as-a-Judge with the following command: ​

python General_Binary_Classifier.py \
       --classification_dataset <synthetic queries file> \
       --test_set_selection <test_set_selection> \
       --label_column Context_Relevance_Label \
       --num_epochs 10 \
       --patience_value 3 \
       --learning_rate 5e-6

For document_filepath, put the filepath of the synthetic queries generated in the previous step. For test_set_selection, put the filepath of the human annotated examples of your dataset; it should be formatted like the file example_files/evaluation_datasets.tsv.

This script will output a model checkpoint path for the next step.

Step #3: Score RAG System with ARES

​ With the outputted model checkpoint from Step #2, you can now score your RAG system's configurations using ARES with following command in folder RAG_Automatic_Evaluation/: ​

python LLMJudge_RAG_Compared_Scoring.py \
       --alpha 0.05 \
       --num_trials 1000 \
       --evaluation_datasets <evaluation_datasets as list> \
       --few_shot_examples_filepath <few_shot_examples_filepath> \
       --checkpoints <checkpoints as list> \
       --labels <label columns as list> \
       --GPT_scoring <True or False> \
       --gold_label_path <gold_label_path>
       --swap_human_labels_for_gpt_labels False

​ For evaluation_datasets, we expect a list of filepaths to query-passage-answer TSVs for each RAG configuration you wish to score.

If you want to use few-shot GPT scoring, switch GPT_scoring to True. You can leave the checkpoints list as blank and specify the GPT model with the tag --gpt_model <model selected>. ​

Note: For examples files of evaluation_datasets and gold_label_path, please see example_files/evaluation_datasets.tsv for formatting.

Results Replication

We include synthetic datasets for key experimental results in synthetic_datasets. The few-shot prompts used for generation and evaluation are included in datasets. We also include instructions for fine-tuning LLM judges in the paper itself. Please reach out to [email protected] if you have any further questions.

Citation

To cite our work, please use the following Bibtex:

@misc{saadfalcon2023ares,
      title={ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems}, 
      author={Jon Saad-Falcon and Omar Khattab and Christopher Potts and Matei Zaharia},
      year={2023},
      eprint={2311.09476},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Appendix

Machine requirements and setup when not using OpenAI API

Machine requirements

  • Over ~100 GB of available disk space
  • GPU
    • Should work: A100 (e.g. Standard_NC24ads_A100_v4 on Azure)
    • Does not work:
      • Tested on 2023-12-17 with both Standard_NC6s_v3 and Standard_NC12s_v3, and ran into this error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.77 GiB total capacity; 15.12 GiB already allocated; 95.44 MiB free; 15.12 GiB reserved in total by PyTorch)

Machine setup

For example, on an Azure VM running Linux (ubuntu 20.04), you will need to do the following:

  • Install conda
    • First set of commands (can copy-paste multiple lines)
      • wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
      • chmod +x Miniconda3-latest-Linux-x86_64.sh
      • ./Miniconda3-latest-Linux-x86_64.sh -b
    • Second set of commands (can copy-paste multiple lines)
      • export PATH="~/miniconda3/bin:$PATH"
      • conda init
  • Install gcc
    • sudo apt-get -y update
    • sudo apt-get -y upgrade
    • sudo apt-get -y install build-essential
    • sudo apt-get -y install libpcre3-dev
  • Install NVIDIA drivers
    • sudo apt install ubuntu-drivers-common -y
    • sudo ubuntu-drivers autoinstall
    • sudo reboot
    • SSH in again and confirm the installation was successful by running nvidia-smi
  • cd to ARES folder and follow the rest of the README

More Repositories

1

ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
Python
2,998
star
2

macrobase

MacroBase: A Search Engine for Fast Data
Java
661
star
3

noscope

Accelerating network inference over video
Python
434
star
4

sparser

Sparser: Raw Filtering for Faster Analytics over Raw Data
C
427
star
5

dawn-bench-entries

DAWNBench: An End-to-End Deep Learning Benchmark and Competition
Python
257
star
6

ASAP

ASAP: Prioritizing Attention via Time Series Smoothing
Jupyter Notebook
184
star
7

FrugalGPT

FrugalGPT: better quality and lower cost for LLM applications
Python
167
star
8

index-baselines

Simple baselines for "Learned Indexes"
HTML
156
star
9

FAST

End-to-end earthquake detection pipeline via efficient time series similarity search
Jupyter Notebook
144
star
10

gavel

Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
Jupyter Notebook
124
star
11

equivariant-transformers

Equivariant Transformer (ET) layers are image-to-image mappings that incorporate prior knowledge on invariances with respect to continuous transformations groups (ICML 2019). Paper: https://arxiv.org/abs/1901.11399
Jupyter Notebook
88
star
12

stk

Python
86
star
13

selection-via-proxy

Python
77
star
14

sinkhorn-label-allocation

Sinkhorn Label Allocation is a label assignment method for semi-supervised self-training algorithms. The SLA algorithm is described in full in this ICML 2021 paper: https://arxiv.org/abs/2102.08622.
Python
53
star
15

readinggroup

45
star
16

cs145-2017

Jupyter Notebook
43
star
17

Willump

Willump Is a Low-Latency Useful Machine learning Platform.
Python
42
star
18

Baleen

Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21)
Python
42
star
19

msketch

Moments Sketch Code
Jupyter Notebook
39
star
20

Uniserve

A runtime implementation of data-parallel actors.
Java
38
star
21

wmsketch

Sketching linear classifiers over data streams with the Weight-Median Sketch (SIGMOD 2018).
C++
38
star
22

dawn-bench-models

Python
36
star
23

momentsketch

Simplified Moment Sketch Implemntation
Java
36
star
24

blazeit

Its BlazeIt because it's blazing fast
C++
28
star
25

optimus-maximus

To Index or Not to Index: Optimizing Exact Maximum Inner Product Search
Python
26
star
26

ACORN

state-of-the-art search over vector embeddings and structured data (SIGMOD '24)
C++
25
star
27

acidrain

2AD analysis prototype and logs from sample applications
Python
22
star
28

lit-code

Code for LIT, ICML 2019
Python
21
star
29

POP

Code for "Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP", which appeared at SOSP 2021
Python
20
star
30

loa

Public code for LOA
Python
18
star
31

omg

Python
17
star
32

pytorch-distributed

Fork of diux-dev/imagenet18
Python
13
star
33

tasti

Semantic Indexes for Machine Learning-based Queries over Unstructured Data (SIGMOD 2022)
Python
13
star
34

cs245-as1

Student files for CS245 Programming Assignment 1: In-memory data layout
Java
12
star
35

offload-annotations

A new approach for bringing heterogeneous computing to existing libraries and workloads.
Python
9
star
36

Willump-Simple

Willump Is a Low-Latency Useful Machine learning Platform.
Python
8
star
37

cs245-as3-public

Durable transactions assignment for CS245
Java
7
star
38

InQuest

Accelerating Aggregation Queries on Unstructured Streams of Data
Python
7
star
39

cs245-as2-public

Scala
7
star
40

training_on_a_dime

Scripts and logs for "Analysis and Expoitation of Dynamic Pricing in the Public Cloud for ML Training", which is to appear at DISPA 2020
Jupyter Notebook
7
star
41

SparseJointShift

Model Performance Estimation and Explanation When Labels and A Few Features Shifts
Python
7
star
42

DROP

Java
6
star
43

tKDC

Repository for tKDE Experiments
Jupyter Notebook
6
star
44

sketchstore

Algorithms for compressing and merging large collections of sketches
Jupyter Notebook
5
star
45

parallel-lb-simulator

Java
4
star
46

crosstrainer

CrossTrainer: Practical Domain Adaptation with Loss Reweighting
Python
4
star
47

smol

C++
4
star
48

supg

Python
3
star
49

fast-tree

C++
3
star
50

abae

Accelerating Approximate Aggregation Queries with Expensive Predicates (VLDB 21)
Python
3
star
51

graphIO

Automated Lower Bounds on the I/O Complexity of Computation Graphs
Python
3
star
52

futuretea-whyrust

Why Rust presentation at FutureTea, 3/13
Rust
3
star
53

ezmode

An iterative algorithm for selecting rare events in large, unlabeled datasets
Python
1
star
54

willump-dfs

Applying Willump design to deep feature synthesis
Python
1
star
55

fexipro-benchmarking

C++
1
star
56

macrobase-cpp

1
star
57

swag-python

Situationally aWAre decodinG
Python
1
star