NaturalProofs: Mathematical Theorem Proving in Natural Language
Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, Kyunghyun Cho
This repo contains:
- The NaturalProofs Dataset
- Tokenized task data for mathematical reference retrieval and generation.
- Preprocessing NaturalProofs and the task data.
- Training and evaluation for mathematical reference retrieval and generation.
- Pretrained models for mathematical reference retrieval and generation.
Please cite our work if you found the resources in this repository useful:
@inproceedings{welleck2021naturalproofs,
title={NaturalProofs: Mathematical Theorem Proving in Natural Language},
author={Sean Welleck and Jiacheng Liu and Ronan Le Bras and Hannaneh Hajishirzi and Yejin Choi and Kyunghyun Cho},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
year={2021},
url={https://openreview.net/forum?id=Jvxa8adr3iY}
}
To download and unpack NaturalProofs, use:
pip install gdown
python download.py --naturalproofs --savedir /path/to/savedir
To download and unpack all files that we describe below, use:
python download.py --naturalproofs --tokenized --checkpoint --other --savedir /path/to/savedir
This creates the following file structure:
{savedir}/data # contains NaturalProofs base data (.json files) and tokenized task data (.pkl files)
{savedir}/ckpt # contains pretrained model checkpoints
{savedir}/other # contains precomputed files for evaluation (ref encodings, etc.)
We provide the NaturalProofs Dataset (JSON per domain):
NaturalProofs Dataset [zenodo] | Domain |
---|---|
naturalproofs_proofwiki.json | ProofWiki |
naturalproofs_stacks.json | Stacks |
naturalproofs_trench.json | Real Analysis textbook |
naturalproofs_stein.json (script) | Number Theory textbook |
To download NaturalProofs, use:
python download.py --naturalproofs --savedir /path/to/savedir
The download includes an extra combined ProofWiki+Stacks file made with notebooks/merge.ipynb.
To see the steps used to create each domain of NaturalProofs from raw data, see the following notebooks.
This preprocessing is not needed if you are using a preprocessed dataset provided above.
Domain | |
---|---|
ProofWiki | notebook |
Stacks | notebook |
Real Analysis textbook | notebook |
Number Theory textbook | notebook |
To use NaturalProofs for the reference retrieval and generation tasks described in the paper, the first step is tokenization.
We tokenize the raw NaturalProofs Dataset into two different formats:
- Pairwise:
(x, r)
x
theorem (sequence of tokens)r
reference (sequence of tokens)- This version is used to train and evaluate the pairwise model.
- Sequence:
(x, [rid_1, ..., rid_Tx])
x
theorem (sequence of tokens)rid_i
reference id- This version is used to train and evaluate the autoregressive and joint models.
We provide the following versions used in the paper (bert-based-cased
tokenizer):
Type | Domain | Splits |
---|---|---|
Pairwise, bert-base-cased |
Proofwiki | train,valid,test |
Stacks | train,valid,test | |
Real Analysis (textbook)) | test | |
Number Theory (textbook) | test | |
Sequence, bert-base-cased |
Proofwiki | train,valid,test |
Stacks | train,valid,test | |
Real Analysis (textbook) | test | |
Number Theory (textbook) | test |
To download and unpack them, use:
python download.py --tokenized --savedir /path/to/savedir
Or use google drive link.
We provide the following models used in the paper:
Type | Domain | |
---|---|---|
Pairwise | bert-base-cased |
Proofwiki |
Pairwise | bert-base-cased |
Stacks |
Pairwise | bert-base-cased |
Proofwiki+Stacks |
Joint | bert-base-cased |
Proofwiki |
Joint | bert-base-cased |
Stacks |
Joint | bert-base-cased |
Proofwiki+Stacks |
Autoregressive | bert-base-cased |
Proofwiki |
Autoregressive | bert-base-cased |
Stacks |
To download and unpack them, use:
python download.py --checkpoint --savedir /path/to/savedir
Or use google drive link.
This step is not needed if you are using a tokenized dataset provided above.
First, setup the code:
python setup.py develop
To create your own tokenized versions:
- Pairwise:
python naturalproofs/tokenize_pairwise.py
- Sequence:
python naturalproofs/encoder_decoder/utils.py
We will show you how to run evaluation on the pretrained model checkpoints & associated files.
We will assume the file structure given by using the download script.
python download.py --naturalproofs --tokenized --checkpoint --other --savedir <SAVE-DIR>
We provide a script which assembles an evaluation command for (model type, domain, task)
combinations.
We show example commands below.
python run_analysis.py \
--train-ds-names {proofwiki stacks}+ \ # one or more training domains to choose a model
--eval-ds-names {proofwiki stacks stein trench}+ \ # one or more evaluation domains
--model {pairwise, joint, autoregressive} \
--generation \ # for generation task (autoregressive or joint models only)
--split {valid, test} \
--gpu <integer GPU id> \
--codedir /path/to/naturalproofs_code \
--datadir <SAVE-DIR>/data \
--ckptdir <SAVE-DIR>/ckpt \
--outdir <SAVE-DIR>/output
To make sure your filepaths line up, please look inside run_analysis.py
to see how the --{}dir
arguments are used.
python run_analysis.py --train-ds-names proofwiki \
--eval-ds-names proofwiki stein trench \
--model pairwise \
--gpu 1 \
--split test
python run_analysis.py --train-ds-names proofwiki \
--eval-ds-names proofwiki \
--model joint \
--gpu 1 \
--split test
For OOD evaluation on stein
and trench
textbooks, provide
reference embeddings from the pairwise model.
These are the __encs.pt
files from running the pairwise retrieval evaluation (we provide an example in other/
).
python run_analysis.py --model joint \
--train-ds-names proofwiki \
--eval-ds-names stein trench \
--stein-rencs <SAVE-DIR>/other/pairwise__train_proofwiki__eval_stein__test__encs.pt \
--trench-rencs <SAVE-DIR>/other/pairwise__train_proofwiki__eval_trench__test__encs.pt \
--gpu 1 \
--split test
To align the model's combined output space with the individual dataset used for evaluation, give a tok2tok.pkl
map (we provide an example in other/
):
python run_analysis.py --model joint \
--train-ds-names both \
--eval-ds-names proofwiki stacks \
--modeltok2datatok <SAVE-DIR>/other/tok2tok.pkl \
--gpu 1 \
--split test
Note that OOD evaluation (stein
or trench
) is not implemented for the combined model.
Without the --generation
flag, adjusts settings for retrieval evaluation:
python run_analysis.py --model autoregressive \
--train-ds-names proofwiki \
--eval-ds-names proofwiki \
--gpu 1 \
--split valid
Note that OOD evaluation (stein
or trench
) is not implemented for the autoregressive model.
python run_analysis.py --model autoregressive --generation \
--train-ds-names proofwiki \
--eval-ds-names proofwiki \
--gpu 1 \
--split valid
The provided code supports:
- Training a pairwise model
- Training an autoregressive or joint model, initialized with pairwise model components (parameters, reference embeddings)
python naturalproofs/model.py --expr-name pairwise \
--datapath /path/to/<DOMAIN>_tokenized__bert-base-cased.pkl \
--default-root-dir /path/to/output
The joint (and autoregressive) model uses a pairwise checkpoint, and reference encodings for initialization.
- The pairwise checkpoint is saved during pairwise training.
- The reference encodings are saved in a
encs.pt
file during pairwise Evaluation.
python naturalproofs/encoder_decoder/model.py \
--datapath /path/to/sequence_<DOMAIN>_tokenized__bert-base-cased.pkl \
--default-root-dir /path/to/output
--pretrained-retrieval-checkpoint /path/to/pairwise__<DOMAIN>.ckpt \
--encs-file /path/to/train_<DOMAIN>__eval_<DOMAIN>__valid__encs.pt \ # obtained from running evaluation on trained pairwise model
--parallel 1 \
--set-mode 1 # discard duplicates
Our implementation uses the same encoder-decoder architecture for the autoregressive and joint models, considering the joint model as a one-step special case (with KL-div loss). See the Appendix for a discussion on this design decision and technical details.
python naturalproofs/encoder_decoder/model.py \
--datapath /path/to/sequence_<DOMAIN>_tokenized__bert-base-cased.pkl \
--default-root-dir /path/to/output
--pretrained-retrieval-checkpoint /path/to/pairwise__<DOMAIN>.ckpt \
--encs-file /path/to/train_<DOMAIN>__eval_<DOMAIN>__valid__encs.pt \ # obtained from running evaluation on trained pairwise model
--parallel 0 \
--set-mode 0 # keep duplicates
TF-IDF example:
python naturalproofs/baselines.py \
--method tfidf \
--datapath /path/to/<DOMAIN>_tokenized__bert-base-cased_200.pkl \
--datapath-base /path/to/naturalproofs_<DOMAIN>.json \
--savedir /path/to/out/
==> /path/to/out/tfidf__eval.pkl
Then use analyze.py
to compute metrics:
python naturalproofs/analyze.py \
--method tfidf \
--eval-path /path/to/out/tfidf__eval.pkl \
--datapath-base /path/to/naturalproofs_<DOMAIN>.json
==> /path/to/out/tfidf__analysis.pkl