• Stars
    star
    154
  • Rank 242,095 (Top 5 %)
  • Language
    Python
  • Created about 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data and Code for Program of Thoughts (TMLR 2023)

Program of Thoughts

This is code repository for the paper Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks.

We propose to disentangle computation and reasoning from in the problem solving process. The large language model only needs to express the thoughts using Python program, the computation and solving process is accomplished via an external Python Interpreter.

  1. We outperform few-shot CoT by an average of 12% on all the datasets evaluated.
  2. We outperform zero-shot CoT also by an average of 12% on all the datasets evaluated.
  3. We achieve SoTA performance with self-consistency decoding on all the evaluated math word problem datasets (GSM8K, AQuA, SVAMP, TabMWP, MultiArith).

Comparison with Few-shot CoT:

Comparison with Few-shot CoT with self-consistency:

Comparison with Zero-shot CoT:

News

  1. Added CoT evaluation for AQuA QA for GPT4, the accuracy is 72.7%
  2. Adding Benchmark

Running the code

First you need to specify your OPENAI key

export OPENAI_KEY = [YOUR_KEY]
  • Few-shot + Greedy
python run_gsm8k.py --greedy
python run_aqua.py --greedy
...
  • Few-shot + Self-Consistency
python run_gsm8k.py
python run_aqua.py
...
  • Zero-shot
python run_gsm8k_zs.py
python run_aqua_zs.py
...

The prediction file will be dumped in the outputs/ folder, let's say gsm8K_s0_e-1_11_17_10_20.jsonl, or gsm8K_sc_s0_e-1_11_08_21_14.jsonl, or gsm8K_zs_s0_e-1_11_19_09_55.jsonl.

  • Evaluation
cd outputs
python compute_score.py --inputs gsm8K_s0_e-1_11_17_10_20.jsonl
python compute_score.py --inputs aqua_s0_e-1_11_06_18_38.jsonl
python compute_score.py --inputs svamp_s0_e-1_11_06_21_11.jsonl
....

Few-shot Results

  1. GSM8K
  • Number of Test Examples: 1318

  • Output: outputs/gsm8K_s0_e-1_11_17_10_20.jsonl

  • EM Score: 0.716

  • Output: outputs/gsm8K_sc_s0_e-1_11_08_21_14.jsonl

  • EM Score: 0.799

  1. AQuA
  • Number of Test Examples: 253

  • Output: outputs/aqua_s0_e-1_11_06_18_38.jsonl

  • EM Score: 0.541

  • Output: aqua_sc_s0_e-1_11_07_20_49.jsonl

  • EM Score: 0.582

  1. SVAMP
  • Number of Test Examples: 1000

  • Output: outputs/svamp_s0_e-1_11_24_14_38.jsonl

  • EM Score: 0.852

  • Output: outputs/svamp_sc_s0_e-1_11_24_15_54.jsonl

  • EM Score: 0.891

  1. TabWMP
  • Number of Test Examples: 7861

  • Output: outputs/tabmwp_s0_e-1_11_06_22_55.jsonl

  • EM Score: 0.732

  • Output: outputs/tabmwp_sc_s0_e-1_11_08_18_21.jsonl

  • EM Score: 0.818

  1. FinQA
  • Number of Test Examples: 1147

  • Ouptut: outputs/finqa_s0_e-1_11_16_13_29.jsonl

  • EM Score: 0.647

  • Output: outputs/finqa_sc_s0_e-1_11_09_13_00.jsonl

  • EM SCore: 0.682

  1. ConvFinQA
  • Number of Test Examples: 421

  • Ouptut: outputs/convfinqa_s0_e-1_11_12_01_38.jsonl

  • EM Score: 0.665

  • Output: outputs/convfinqa_sc_s0_e-1_11_12_02_27.jsonl

  • EM SCore: 0.714

  1. TATQA
  • Number of Test Examples: 1668

  • Output: outputs/tatqa_8shot_11_06_19_53.json

  • EM Score: 0.689

  • Output: outputs/tatqa_8shot_11_06_19_53.json

  • EM Score: 0.702

Zero-shot Results

  1. GSM8K
  • Number of Test Examples: 1318
  • Output: outputs/gsm8K_zs_s0_e-1_11_19_09_55.jsonl
  • EM Score: 0.569
  1. AQuA
  • Number of Test Examples: 253
  • Output: outputs/aqua_zs_s0_e-1_11_19_11_56.jsonl
  • EM Score: 0.438
python compute_score.py --inputs aqua_zs_s0_e-1_11_19_11_56.jsonl --relaxed
  1. SVAMP
  • Number of Test Examples: 1000
  • Output: outputs/svamp_zs_s0_e-1_11_18_20_12.jsonl
  • EM Score: 0.708
  1. MultiArith
  • Number of Test Examples: 600
  • Output: outputs/multiarith_zs_s0_e-1_11_19_20_12.jsonl
  • EM Score: 0.922
  1. TabMWP
  • Output: outputs/tabmwp_zs_s0_e-1_11_19_20_01.jsonl
  • EM Score: 0.646

Cite our Work

@article{chen2022program,
  title = {Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks},
  author = {Wenhu Chen and Xueguang Ma and Xinyi Wang and William W. Cohen},
  journal={arXiv preprint arXiv:2211.12588},
  year = {2022},
}

More Repositories

1

Table-Fact-Checking

Data and Code for ICLR2020 Paper "TabFact: A Large-scale Dataset for Table-based Fact Verification"
Python
369
star
2

HybridQA

Dataset and code for EMNLP2020 paper "HybridQA: A Dataset of Multi-Hop Question Answeringover Tabular and Textual Data"
Python
188
star
3

LogicNLG

The data and code for ACL2020 paper "Logical Natural Language Generation from Open-Domain Tables"
Python
163
star
4

TheoremQA

The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset
Python
143
star
5

OTT-QA

Code and Data for ICLR2021 Paper "Open Question Answering over Tables and Text"
Python
142
star
6

KGPT

Code and Data for EMNLP2020 Paper "KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation"
Python
142
star
7

HDSA-Dialog

Code and Data for ACL 2019 "Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention"
Python
136
star
8

Time-Sensitive-QA

Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"
Jupyter Notebook
47
star
9

Variational-Vocabulary-Selection

Code for NAACL19 Paper "How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection"
Python
42
star
10

KB-Reasoning-Data

The FB15k and NELL-995 Dataset for NAACL18 paper "Variational Knowledge Graph Reasoning"
39
star
11

Meta-Module-Network

Code for WACV 2021 Paper "Meta Module Network for Compositional Visual Reasoning"
Python
39
star
12

Cross-Lingual-NBT

Code for EMNLP 2018 paper "XL-NBT: A Cross-lingual Neural Belief Tracking Framework"
Python
36
star
13

Semi-Supervised-Image-Captioning

Code for "bootstrap, review, decode: using out-of-domain textual data to improve image captioning"
Jupyter Notebook
20
star
14

GNN-TabFact

SOTA on TabFact: Graph Neural Network for Table-based Fact Checking
Python
18
star
15

TableCoT

The code and data used for "Large Language Models are few(1)-shot Table Reasoners"
Python
18
star
16

GPT2-Logic2Text

The code for Template-GPT-2 Generation Model for Logic2Text Dataset
Python
18
star
17

WikiTables-WithLinks

Crawled Wikipedia Tables with Passages
Python
11
star
18

ImageEval

Editing Baselines
Jupyter Notebook
4
star
19

Data-to-text-Evaluation-Metric

The metric computation script for different data to text tasks
Python
3
star
20

wenhuchen.github.io

Personal Website
HTML
2
star
21

opendomaintables.github.io

Visualization of Open Domain Tables
HTML
1
star
22

cs486-fall2024-website

Website Page for CS486-fall2024
1
star
23

Scripts

Useful Small Functions to help me deal with different scenarios
Python
1
star
24

WikiTables

The collection of WikiTables
1
star
25

setting_files

Shell
1
star