• Stars
    star
    478
  • Rank 89,311 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 1 year ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.

🐫 🍮 📚 InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

Paper | Model | Leaderboard

📣 Red-Eval, the benchmark for Safety Evaluation of LLMs has been added: Red-Eval

📣 Introducing Red-Eval to evaluate the safety of the LLMs using several jailbreaking prompts. With Red-Eval one could jailbreak/red-team GPT-4 with a 65.1% attack success rate and ChatGPT could be jailbroken 73% of the time as measured on DangerousQA and HarmfulQA benchmarks. More details are here: Code and Paper.

📣 We developed Flacuna by fine-tuning Vicuna-13B on the Flan collection. Flacuna is better than Vicuna at problem-solving. Access the model here https://huggingface.co/declare-lab/flacuna-13b-v1.0.

📣 The InstructEval benchmark and leaderboard have been released.

📣 The paper reporting Instruction Tuned LLMs on the InstructEval benchmark suite has been released on Arxiv. Read it here: https://arxiv.org/pdf/2306.04757.pdf

📣 We are releasing IMPACT, a dataset for evaluating the writing capability of LLMs in four aspects: Informative, Professional, Argumentative, and Creative. Download it from Huggingface: https://huggingface.co/datasets/declare-lab/InstructEvalImpact.

📣 FLAN-T5 is also useful in text-to-audio generation. Find our work at https://github.com/declare-lab/tango if you are interested.

This repository contains code to evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks. We aim to facilitate simple and convenient benchmarking across multiple tasks and models.

Why?

Instruction-tuned models such as Flan-T5 and Alpaca represent an exciting direction to approximate the performance of large language models (LLMs) like ChatGPT at lower cost. However, it is challenging to compare the performance of different models qualitatively. To evaluate how well the models generalize across a wide range of unseen and challenging tasks, we can use academic benchmarks such as MMLU and BBH. Compared to existing libraries such as evaluation-harness and HELM, this repo enables simple and convenient evaluation for multiple models. Notably, we support most models from HuggingFace Transformers 🤗 (check here for a list of models we support):

Results

For detailed results, please go to our leaderboard

Model Name Model Path Paper Size MMLU BBH DROP HumanEval
GPT-4 Link ? 86.4 80.9 67.0
ChatGPT Link ? 70.0 64.1 48.1
seq_to_seq google/flan-t5-xxl Link 11B 54.5 43.9
seq_to_seq google/flan-t5-xl Link 3B 49.2 40.2 56.3
llama eachadea/vicuna-13b Link 13B 49.7 37.1 32.9 15.2
llama decapoda-research/llama-13b-hf Link 13B 46.2 37.1 35.3 13.4
seq_to_seq declare-lab/flan-alpaca-gpt4-xl Link 3B 45.6 34.8
llama TheBloke/koala-13B-HF Link 13B 44.6 34.6 28.3 11.0
llama chavinlo/alpaca-native Link 7B 41.6 33.3 26.3 10.3
llama TheBloke/wizardLM-7B-HF Link 7B 36.4 32.9 15.2
chatglm THUDM/chatglm-6b Link 6B 36.1 31.3 44.2 3.1
llama decapoda-research/llama-7b-hf Link 7B 35.2 30.9 27.6 10.3
llama wombat-7b-gpt4-delta Link 7B 33.0 32.4 7.9
seq_to_seq bigscience/mt0-xl Link 3B 30.4
causal facebook/opt-iml-max-1.3b Link 1B 27.5 1.8
causal OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 Link 12B 27.0 30.0 9.1
causal stabilityai/stablelm-base-alpha-7b Link 7B 26.2 1.8
causal databricks/dolly-v2-12b Link 12B 25.7 7.9
causal Salesforce/codegen-6B-mono Link 6B 27.4
causal togethercomputer/RedPajama-INCITE-Instruct-7B-v0.1 Link 7B 38.1 31.3 24.7 5.5

Example Usage

Evaluate on Massive Multitask Language Understanding (MMLU) which includes exam questions from 57 tasks such as mathematics, history, law, and medicine. We use 5-shot direct prompting and measure the exact-match score.

python main.py mmlu --model_name llama --model_path chavinlo/alpaca-native
# 0.4163936761145136

python main.py mmlu --model_name seq_to_seq --model_path google/flan-t5-xl 
# 0.49252243270189433

Evaluate on Big Bench Hard (BBH) which includes 23 challenging tasks for which PaLM (540B) performs below an average human rater. We use 3-shot direct prompting and measure the exact-match score.

python main.py bbh --model_name llama --model_path TheBloke/koala-13B-HF --load_8bit
# 0.3468942926723247

Evaluate on DROP which is a math question answering benchmark. We use 3-shot direct prompting and measure the exact-match score.

python main.py drop --model_name seq_to_seq --model_path google/flan-t5-xl 
# 0.5632458233890215

Evaluate on HumanEval which includes 164 coding questions in python. We use 0-shot direct prompting and measure the pass@1 score.

python main.py humaneval  --model_name llama --model_path eachadea/vicuna-13b --n_sample 1 --load_8bit
# {'pass@1': 0.1524390243902439}

Setup

Install dependencies and download data.

conda create -n instruct-eval python=3.8 -y
conda activate instruct-eval
pip install -r requirements.txt
mkdir -p data
wget https://people.eecs.berkeley.edu/~hendrycks/data.tar -O data/mmlu.tar
tar -xf data/mmlu.tar -C data && mv data/data data/mmlu

More Repositories

1

conv-emotion

This repo contains implementation of different architectures for emotion recognition in conversations.
Python
1,259
star
2

tango

Codes and Model of the paper "Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model"
Python
754
star
3

MELD

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation
Python
745
star
4

multimodal-deep-learning

This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.
OpenEdge ABL
664
star
5

awesome-sentiment-analysis

Reading list for Awesome Sentiment Analysis papers
514
star
6

flan-alpaca

This repository contains code for extending the Stanford Alpaca synthetic instruction tuning to existing instruction-tuned models such as Flan-T5.
Python
336
star
7

awesome-emotion-recognition-in-conversations

A comprehensive reading list for Emotion Recognition in Conversations
241
star
8

MISA

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
Python
159
star
9

RECCON

This repository contains the dataset and the PyTorch implementations of the models from the paper Recognizing Emotion Cause in Conversations.
Python
159
star
10

Multimodal-Infomax

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.
Python
147
star
11

dialogue-understanding

This repository contains PyTorch implementation for the baseline models from the paper Utterance-level Dialogue Understanding: An Empirical Study
Python
122
star
12

RelationPrompt

This repository implements our ACL Findings 2022 research paper RelationPrompt: Leveraging Prompts to Generate Synthetic Data for Zero-Shot Relation Triplet Extraction. The goal of Zero-Shot Relation Triplet Extraction (ZeroRTE) is to extract relation triplets of the format (head entity, tail entity, relation), despite not having annotated data for the test relation labels.
Python
121
star
13

contextual-utterance-level-multimodal-sentiment-analysis

Context-Dependent Sentiment Analysis in User-Generated Videos
Python
118
star
14

flacuna

Flacuna was developed by fine-tuning Vicuna on Flan-mini, a comprehensive instruction collection encompassing various tasks. Vicuna is already an excellent writing assistant, and the intention behind Flacuna was to enhance Vicuna's problem-solving capabilities. To achieve this, we curated a dedicated instruction dataset called Flan-mini.
Python
106
star
15

CASCADE

This repo contains code to detect sarcasm from text in discussion forum using deep learning
Python
86
star
16

BBFN

This repository contains the implementation of the paper -- Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis
Python
61
star
17

CICERO

The purpose of this repository is to introduce new dialogue-level commonsense inference datasets and tasks. We chose dialogues as the data source because dialogues are known to be complex and rich in commonsense.
Python
60
star
18

dialog-HGAT

Dialogue Relation Extraction with Document-level Heterogeneous Graph Attention Networks
Python
57
star
19

kingdom

Domain Adaptation using External Knowledge for Sentiment Analysis
Python
57
star
20

red-instruct

Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Python
55
star
21

hfusion

Multimodal sentiment analysis using hierarchical fusion with context modeling
Python
44
star
22

MIME

This repository contains PyTorch implementations of the models from the paper An Empirical Study MIME: MIMicking Emotions for Empathetic Response Generation.
Python
42
star
23

speech-adapters

Codes and datasets for our ICASSP2023 paper, Evaluating parameter-efficient transfer learning approaches on SURE benchmark for speech understanding
Python
38
star
24

MSA-Robustness

NAACL 2022 paper on Analyzing Modality Robustness in Multimodal Sentiment Analysis
Python
30
star
25

sentence-ordering

This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense Knowledge appearing at EMNLP 2021.
Python
28
star
26

CIDER

This repository contains the dataset and the pytorch implementations of the models from the paper CIDER: Commonsense Inference for Dialogue Explanation and Reasoning. CIDER has been accepted to appear at SIGDIAL 2021.
Python
28
star
27

LLM-PuzzleTest

This repository is maintained to release dataset and models for multimodal puzzle reasoning.
Python
25
star
28

HyperRED

This repository implements our EMNLP 2022 research paper A Dataset for Hyper-Relational Extraction and a Cube-Filling Approach.
Python
24
star
29

identifiable-transformers

Python
22
star
30

exemplary-empathy

This repository contains the source codes of the paper -- Exemplars-guided Empathetic Response Generation Controlled by the Elements of Human Communication
Python
22
star
31

MM-Align

[EMNLP 2022] This repository contains the official implementation of the paper "MM-Align: Learning Optimal Transport-based Alignment Dynamics for Fast and Accurate Inference on Missing Modality Sequences"
Python
22
star
32

TEAM

Our EMNLP 2022 paper on MCQA
Python
21
star
33

MM-InstructEval

This repository contains code to evaluate various multimodal large language models using different instructions across multiple multimodal content comprehension tasks.
Python
20
star
34

DoubleMix

Code for the COLING 2022 paper "DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification"
Python
20
star
35

resta

Restore safety in fine-tuned language models through task arithmetic
Python
15
star
36

M2H2-dataset

This repository contains the dataset and baselines explained in the paper: M2H2: A Multimodal Multiparty Hindi Dataset For HumorRecognition in Conversations
Python
15
star
37

ASTE-RL

This repository contains the source codes for the paper: "Aspect Sentiment Triplet Extraction using Reinforcement Learning" published at CIKM 2021.
Python
14
star
38

SAT

Code for the EMNLP 2022 Findings short paper "SAT: Improving Semi-Supervised Text Classification with Simple Instance-Adaptive Self-Training"
Jupyter Notebook
13
star
39

adapter-mix

Python
12
star
40

KNOT

This repository contains the implementation of the paper -- KNOT: Knowledge Distillation using Optimal Transport for Solving NLP Tasks
Python
12
star
41

WikiDes

A Wikipedia-based summarization dataset
Python
12
star
42

VIP

Our EMNLP 2022 paper on VIP-Based Prompting for Parameter-Efficient Learning
Python
9
star
43

domadapter

Code for EACL'23 paper "Udapter: Efficient Domain Adaptation Using Adapters"
Python
8
star
44

InstrAug

[Arxiv 2024] Official Implementation of the paper: "InstrAug: Automatic Instruction Augmentation for Multimodal Instruction Fine-tuning"
Jupyter Notebook
7
star
45

SANCL

[COLING 2022] This repository contains the code of the paper SANCL: Multimodal Review Helpfulness Prediction with Selective Attention and Natural Contrastive Learning.
Python
7
star
46

Sealing

[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"
Python
6
star
47

segue

Codes and Checkpoints of the Interspeech paper "Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding"
Python
6
star
48

llm_robustness

Python
2
star
49

NLP-OT

1
star