• Stars
    star
    298
  • Rank 136,532 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 1 year ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.

HaluEval: A Hallucination Evaluation Benchmark for LLMs

This is the repo for our paper: HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. The repo contains:

Overview

HaluEval includes 5,000 general user queries with ChatGPT responses and 30,000 task-specific examples from three tasks, i.e., question answering, knowledge-grounded dialogue, and text summarization.

For general user queries, we adopt the 52K instruction tuning dataset from Alpaca. To further screen user queries where LLMs are most likely to produce hallucinations, we use ChatGPT to sample three responses for each query and finally retain the queries with low-similarity responses for human labeling.

Furthermore, for the task-specific examples in HaluEval, we design an automatic approach to generate hallucinated samples. First, based on existing task datasets (e.g., HotpotQA) as seed data, we design task-specific instructions for ChatGPT to generate hallucinated samples in two methods, i.e., one-pass and conversational. Second, to select the most plausible and difficult hallucinated sample for LLMs evaluation, we elaborate the filtering instruction enhanced by ground-truth examples and leverage ChatGPT for sample selection.

HaluEval

Data Release

The directory data contains 35K generated and human-annotated hallucinated samples we used in our experiments. There are four JSON files as follows:

  • qa_data.json: 10K hallucinated samples for QA based on HotpotQA as seed data. For each sample dictionary, the fields knowledge, question, and right_answer refer to the knowledge from Wikipedia, question text, and ground-truth answer collected from HotpotQA. The field hallucinated_answer is the generated hallucinated answer correspondingly.
  • dialogue_data.json: 10K hallucinated samples for dialogue based on OpenDialKG as seed data. For each sample dictionary, the fields knowledge, dialogue_history, and right_response refer to the knowledge from Wikipedia, dialogue history, and ground-truth response collected from OpenDialKG. The field hallucinated_response is the generated hallucinated response correspondingly.
  • summarization_data.json: 10K hallucinated samples for summarization based on CNN/Daily Mail as seed data. For each sample dictionary, the fields document and right_summary refer to the document and ground-truth summary collected from CNN/Daily Mail. The field hallucinated_summary is the generated hallucinated summary correspondingly.
  • general_data.json: 5K human-annotated samples for ChatGPT responses to general user queries from Alpaca. For each sample dictionary, the fields user_query, chatgpt_response, and hallucination_label refer to the posed user query, ChatGPT response, and hallucination label (Yes/No) annotated by humans.

Based on these data, you can evaluate the ability of LLMs to recognize hallucinations and analyze what type of contents/topics LLMs tend to hallucinate (or fail to recognize the contained hallucination).

Data Generation Process

We executed the data generation pipeline via ChatGPT according to the following steps:

  • First, we download the training sets of HotpotQA, OpenDialKG, and CNN/Daily Mail.
cd generation
wget http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_train_v1.1.json
wget https://raw.githubusercontent.com/facebookresearch/opendialkg/main/data/opendialkg.csv
wget https://huggingface.co/datasets/ccdv/cnn_dailymail/blob/main/cnn_stories.tgz
  • Second, we sample 10K samples and generate their hallucinated counterparts by setting the task and sampling strategy.
    • seed_data: the downloaded training sets of HotpotQA, OpenDialKG, and CNN/Daily Mail.
    • task: sampled tasks, i.e., qa, dialogue, or summarization.
    • strategy: sampling strategy, i.e., one-turn or multi-turn. (one-pass and conversational in our paper)
python generate.py --seed_data hotpot_train_v1.1.json --task qa --strategy one-turn
  • Finally, we select the most plausible and difficult hallucinated sample from these two sampling methods. The final selected samples will be stored in the data directory.
    • task: filtered task, i.e., qa, dialogue, or summarization.
python filtering.py --task qa

Users can use our provided instructions and codes on their own datasets to generate hallucinated samples.

Evaluation

In evaluation, we randomly sample a ground-truth or a hallucinated output for each data. For example, if the text is a hallucinated answer, the LLM should recognize the hallucination and output "Yes", which means the text contains hallucinations. If the text is a ground-truth answer, the LLM should output "No" indicating that there is no hallucination.

  • task: evaluated task, i.e., qa, dialogue, or summarization.
  • model: evaluated model, e.g., ChatGPT (gpt-3.5-turbo), GPT-3 (davinci).
cd evaluation
python evaluate.py --task qa --model gpt-3.5-turbo

Analysis

Based on the samples that LLMs succeed or fail to recognize, we can analyze the topics of these samples using LDA.

  • task: analyzed task, i.e., qa, dialogue, or summarization.
  • result: the file of recognition results at the evaluation stage.
  • category: all (all task samples), failed (task samples that LLMs fail to recognize hallucinations)
cd analysis
python analyze.py --task qa --result ../evaluation/qa/qa_gpt-3.5-turbo_result.json --category all

Reference

Please cite the repo if you use the data or code in this repo.

@misc{HaluEval,
  author = {Junyi Li and Xiaoxue Cheng and Wayne Xin Zhao and Jian-Yun Nie and Ji-Rong Wen },
  title = {HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models},
  year = {2023},
  journal={arXiv preprint arXiv:2305.11747},
  url={https://arxiv.org/abs/2305.11747}
}

More Repositories

1

LLMSurvey

The official GitHub page for the survey paper "A Survey of Large Language Models".
Python
9,332
star
2

RecBole

A unified, comprehensive and efficient recommendation library
Python
3,243
star
3

TextBox

TextBox 2.0 is a text generation library with pre-trained language models
Python
1,065
star
4

Awesome-RSPapers

Recommender System Papers
902
star
5

RecSysDatasets

This is a repository of public data sources for Recommender Systems (RS).
Python
739
star
6

CRSLab

CRSLab is an open-source toolkit for building Conversational Recommender System (CRS).
Python
474
star
7

LLMBox

A comprehensive library for implementing LLMs, including a unified training pipeline and comprehensive model evaluation.
Python
410
star
8

Top-conference-paper-list

A collection of classified and organized top conference paper list.
362
star
9

LLMRank

[ECIR'24] Implementation of "Large Language Models are Zero-Shot Rankers for Recommender Systems"
Python
205
star
10

Negative-Sampling-Paper

This repository collects 100 papers related to negative sampling methods.
173
star
11

DenseRetrieval

170
star
12

RecBole2.0

An up-to-date, comprehensive and flexible recommendation library
167
star
13

RecBole-GNN

Efficient and extensible GNNs enhanced recommender library based on RecBole.
Python
159
star
14

UniSRec

[KDD'22] Official PyTorch implementation for "Towards Universal Sequence Representation Learning for Recommender Systems".
Python
158
star
15

NCL

[WWW'22] Official PyTorch implementation for "Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning".
Python
113
star
16

RSPapers

Must-read papers on Recommender System. 推荐系统相关论文整理(内含40篇论文,并持续更新中)
89
star
17

RecBole-CDR

This is a library built upon RecBole for cross-domain recommendation algorithms
Python
78
star
18

MVP

This repository is the official implementation of our paper MVP: Multi-task Supervised Pre-training for Natural Language Generation.
67
star
19

VQ-Rec

[WWW'23] PyTorch implementation for "Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders".
Python
51
star
20

RecBole-PJF

Python
46
star
21

ChatCoT

The official repository of "ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models"
Python
41
star
22

CORE

[SIGIR'22] Official PyTorch implementation for "CORE: Simple and Effective Session-based Recommendation within Consistent Representation Space".
Python
38
star
23

Multi-View-Co-Teaching

Code for our CIKM 2020 paper "Learning to Match Jobs with Resumes from Sparse Interaction Data using Multi-View Co-Teaching Network"
Python
29
star
24

JiuZhang

Our code will be public soon .
Python
25
star
25

ELMER

This repository is the official implementation of our EMNLP 2022 paper ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation
Python
24
star
26

BAMBOO

Python
23
star
27

Language-Specific-Neurons

Python
17
star
28

RecBole-DA

Python
17
star
29

CARP

Python
16
star
30

SAFE

The pytorch implementation of the SAFE model presented in NAACL-Findings-2022
Python
16
star
31

RecBole-TRM

Python
13
star
32

Erya

12
star
33

MML

Python
12
star
34

Context-Tuning

This is the repository for COLING 2022 paper "Context-Tuning: Learning Contextualized Prompts for Natural Language Generation".
11
star
35

UniWeb

The official repository for our ACL 2023 Findings paper: The Web Can Be Your Oyster for Improving Language Models
9
star
36

PPGM

[ICDM'22] PyTorch implementation for "Privacy-Preserved Neural Graph Similarity Learning".
Python
6
star
37

LIVE

The official repository our ACL 2023 paper: "Learning to Imagine: Visually-Augmented Natural Language Generation"."
Python
5
star
38

Social-Datasets

A collection of social datasets for RecBole-GNN.
5
star
39

M3SRec

4
star
40

FIGA

Python
3
star
41

Contrastive-Curriculum-Learning

Python
3
star
42

Data-CUBE

3
star
43

Div-Ref

The official repository of "Not All Metrics Are Guilty: Improving NLG Evaluation Diversifying References".
Python
2
star
44

GenRec

Python
1
star
45

ETRec

Python
1
star