• Stars
    star
    143
  • Rank 255,485 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset

TheoremQA

The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset (https://arxiv.org/abs/2305.12524).

Introduction

We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.

The dataset covers a wide range of topics listed below:

Examples

Huggingface

Our dataset is on Huggingface now: https://huggingface.co/datasets/wenhu/TheoremQA

from datasets import load_dataset
dataset = load_dataset("wenhu/TheoremQA")

Files

  • theoremqa_test.json: this file contains all the annotated question-answer pairs.
  • theoremqa_visual_subset_test.json: this file contains the subset of visual questions if you want to specifically test that.
  • all_theorems.json: this file contains the textual description of all the theorems being covered.
  • error_analysis/*: this folder contains the error analysis results on the 180-question subset.
  • solutions/*: this folder contains solutions for roughly 180 questions, which correspond to the problems used in error_analysis/
  • outputs/*.json.corrected: this folder contains all the model outputs.

Visualize the GPT-4 output at https://github.com/wenhuchen/TheoremQA/blob/main/visualize.ipynb.

Running Instruction

Dependency

  • openai == 0.27.6
  • wolframalpha == 5.0.0
  • pytorch == py3.8_cuda11.8_cudnn8.7.0_0
  • sympy == 1.11.1
  • transformers == 4.29.1
  • accelerate == 0.19.0
  • anthropic == 0.2.9

Chain-of-Thoughts Prompting

python run_gpt4.py

This will write output to outputs/GPT4_s0...

Program-of-Thoughts Prompting

python run_gpt4_pot.py

This will write output to outputs/GPT4_PoT_s0...

Evaluate model output

You need to register wolfram|alpha account to use their free API, checkout https://products.wolframalpha.com/api to register. Once you are done, you should receive an API_KEY.

export OPENAI_KEY=[YOUR_KEY]
export WOLFRAM_KEY=[YOUR_KEY]
python predict_accuracy.py outputs/[YOUR_FILE]

This will write an evaluation output as outputs/[YOUR_FILE].corrected

Analyze the model output

python analyze_results.py outputs/[YOUR_FILE].corrected

Leaderboard

Model Method Accuracy
GPT-4 PoT 52.4
GPT-4 CoT 43.8
ChatGPT PoT 35.6
PaLM-2 (unicorn) CoT 31.8
ChatGPT CoT 30.2
GPT-3.5 (text-davinci-003) PoT 27.8
Claude-v1 PoT 25.9
Claude-v1 CoT 24.9
Claude-v2 CoT 24.6
Claude-instant CoT 23.6
Codex (code-davinci-002) PoT 23.9
GPT-3.5 (text-davinci-003) CoT 22.8
PaLM-2 (bison) CoT 21.0
GPT-3 (text-davinci-002) PoT 20.6
GPT-3 (text-davinci-002) CoT 16.6
Alpaca CoT 13.5
Vicuna CoT 12.9
MOSS CoT 12.2
StarChat PoT 12.2
InstructCodeT5+ PoT 11.6
OpenAssistant CoT 10.7

Cite our Work

@article{chen2023theoremqa,
  title={TheoremQA: A Theorem-driven Question Answering dataset},
  author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu},
  journal={arXiv preprint arXiv:2305.12524},
  year={2023}
}

More Repositories

1

Table-Fact-Checking

Data and Code for ICLR2020 Paper "TabFact: A Large-scale Dataset for Table-based Fact Verification"
Python
369
star
2

HybridQA

Dataset and code for EMNLP2020 paper "HybridQA: A Dataset of Multi-Hop Question Answeringover Tabular and Textual Data"
Python
188
star
3

LogicNLG

The data and code for ACL2020 paper "Logical Natural Language Generation from Open-Domain Tables"
Python
163
star
4

Program-of-Thoughts

Data and Code for Program of Thoughts (TMLR 2023)
Python
154
star
5

OTT-QA

Code and Data for ICLR2021 Paper "Open Question Answering over Tables and Text"
Python
142
star
6

KGPT

Code and Data for EMNLP2020 Paper "KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation"
Python
142
star
7

HDSA-Dialog

Code and Data for ACL 2019 "Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention"
Python
136
star
8

Time-Sensitive-QA

Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"
Jupyter Notebook
47
star
9

Variational-Vocabulary-Selection

Code for NAACL19 Paper "How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection"
Python
42
star
10

KB-Reasoning-Data

The FB15k and NELL-995 Dataset for NAACL18 paper "Variational Knowledge Graph Reasoning"
39
star
11

Meta-Module-Network

Code for WACV 2021 Paper "Meta Module Network for Compositional Visual Reasoning"
Python
39
star
12

Cross-Lingual-NBT

Code for EMNLP 2018 paper "XL-NBT: A Cross-lingual Neural Belief Tracking Framework"
Python
36
star
13

Semi-Supervised-Image-Captioning

Code for "bootstrap, review, decode: using out-of-domain textual data to improve image captioning"
Jupyter Notebook
20
star
14

GNN-TabFact

SOTA on TabFact: Graph Neural Network for Table-based Fact Checking
Python
18
star
15

TableCoT

The code and data used for "Large Language Models are few(1)-shot Table Reasoners"
Python
18
star
16

GPT2-Logic2Text

The code for Template-GPT-2 Generation Model for Logic2Text Dataset
Python
18
star
17

WikiTables-WithLinks

Crawled Wikipedia Tables with Passages
Python
11
star
18

ImageEval

Editing Baselines
Jupyter Notebook
4
star
19

Data-to-text-Evaluation-Metric

The metric computation script for different data to text tasks
Python
3
star
20

wenhuchen.github.io

Personal Website
HTML
2
star
21

opendomaintables.github.io

Visualization of Open Domain Tables
HTML
1
star
22

cs486-fall2024-website

Website Page for CS486-fall2024
1
star
23

Scripts

Useful Small Functions to help me deal with different scenarios
Python
1
star
24

WikiTables

The collection of WikiTables
1
star
25

setting_files

Shell
1
star