Table of Contents
Introduction
This repository contains the code for RSTFinder -- a discourse segmenter & shift-reduce parser based on rhetorical structure theory. A detailed system description can be found in this paper.
Installation
RSTFinder currently works only on Linux and requires Python 3.7, 3.8, 3.9, or 3.10. Python 3.6 is not supported.
The only way to install RSTFinder is by using the conda
package manager. If you have already installed conda
, you can skip straight to Step 2.
-
To install
conda
, follow the instructions on this page. -
Create a new conda environment (say,
rstenv
) and install the RSTFinder conda package in it.conda create -n rstenv -c conda-forge -c ets python=3.8 rstfinder
-
Activate this conda environment by running
conda activate rstfinder
. -
Now install the
python-zpar
package viapip
in this environment. This package allows us to use the ZPar constituency parser (more later).pip install python-zpar
-
From now on, you will need to activate this conda environment whenever you want to use RSTFinder. This will ensure that the packages required by RSTFinder will not affect other projects.
Usage
RSTFinder is trained using RST Discourse Treebank and the Penn Treebank. However, these treebanks are not freely available and can only be accessed via a personal/academic/institutional subscription to the Linguistic Data Consortium (LDC). This means that we cannot make the RSTFinder parser models publicly available. However, we provide detailed instructions for users so that they can train their own RSTFinder models once they do have access to the treebanks.
Train models
-
Activate the conda environment. Activate the previously created
rstenv
conda environment (see installation):conda activate rstenv
-
Download NLTK tagger model. Due to a rare mismatch between the RST Discourse Treebank and the Penn Treebank documents, sometimes there are parts of the document for which we cannot locate the corresponding parse trees. To get around this issue, we first sentence-tokenize & part-of-speech tag such parts using the MaxEnt POS tagger model from NLTK and, then, just create fake, shallow trees for them. Therefore, we need to download tokenizer and tagger models for this.
export NLTK_DATA="$HOME/nltk_data" python -m nltk.downloader maxent_treebank_pos_tagger punkt
-
Pre-process and merge the treebanks. To create a merged dataset that contains the RST Discourse Treebank along with the corresponding Penn Treebank parse trees for the same documents, run the following command (with paths adjusted as appropriate):
convert_rst_discourse_tb ~/corpora/rst_discourse_treebank ~/corpora/treebank_3
where
~/corpora/rst_discourse_treebank
is the directory that contains the RST Discourse Treebank files. If you obtained this treebank from the LDC, then this is the directory that contains theindex.html
file. Similarly,~/corpora/treebank_3
is the directory that contains the Penn Treebank files. If you obtained this treebank from the LDC, then this is the directory that contains theparsed
sub-directory. -
Create a development set. Split the documents in the RST discourse treebank training set into a new training and development set:
make_traindev_split
At the end of this command, you will have the following JSON files in your current directory:
-
rst_discourse_tb_edus_TRAINING.json
: the original RST Discourse Treebank training set merged with the corresponding Penn Treebank trees in JSON format. -
rst_discourse_tb_edus_TEST.json
: the original RST Discourse Treebank test set merged with the corresponding Penn Treebank trees in JSON format. -
rst_discourse_tb_edus_TRAINING_DEV.json
: the development set split fromrst_discourse_tb_edus_TRAINING.json
. This fill will be used to tune the segmenter and RST parser hyperparameters. -
rst_discourse_tb_edus_TRAINING_TRAIN.json
: the training set split fromrst_discourse_tb_edus_TRAINING.json
. This file will be used to train the segmenter and the parser.
-
-
Extract the segmenter features. Create inputs (features and labels) to train a discourse segmentation model from the newly created training set:
extract_segmentation_features rst_discourse_tb_edus_TRAINING_TRAIN.json rst_discourse_tb_edus_features_TRAINING_TRAIN.tsv
and the development set:
extract_segmentation_features rst_discourse_tb_edus_TRAINING_DEV.json rst_discourse_tb_edus_features_TRAINING_DEV.tsv
The extracted features for the training and development set are now in the
rst_discourse_tb_edus_features_TRAINING_TRAIN.tsv
andrst_discourse_tb_edus_features_TRAINING_DEV.tsv
files respectively. -
Train the CRF segmenter model and tune its hyper-parameters. Train (with the training set) and tune (with the development set) a CRF-based discourse segmentation model:
tune_segmentation_model rst_discourse_tb_edus_features_TRAINING_TRAIN.tsv rst_discourse_tb_edus_features_TRAINING_DEV.tsv segmentation_model
This command iterates over a pre-defined list of values for the
C
regularization parameter for the CRF, trains a model using the features extracted from the training set, and then evaluates that model on the development set. Its final output is theC
value that yields the highest performance F1 score on the development set. After this command, you will have a number of files with the prefixsegmentation_model
in the current directory, e.g.,segmentation_model.C0.25
,segmentation_model.C1.0
et cetera. These are the CRF model files trained with those specific values of theC
regularization parameter. Underlyingly, the command uses thecrf_learn
andcrf_test
binaries from CRFPP viasubprocess
. -
Train the logistic regression RST Parsing model and tune its hyper-parameters. Train (with the training set) and tune (with the development set) a discourse parsing model that uses logistic regression:
tune_rst_parser rst_discourse_tb_edus_TRAINING_TRAIN.json rst_discourse_tb_edus_TRAINING_DEV.json rst_parsing_model
This command iterates over a pre-defined list of values for the
C
regularization parameter for logistic regression, trains a model using the features extracted from the training set, and then evaluates that model on the development set. Its final output is theC
value that yields the highest performance F1 score on the development set. After this command, you will have a number of directories with the prefixrst_parsing_model
in the current directory, e.g.,rst_parsing_model.C0.25
,segmentation_model.C1.0
et cetera. Each of these directories contains the logistic regression model files (namedrst_parsing_all_feats_LogisticRegression.model
) trained with those specific values of theC
regularization parameter. Underlyingly, this command uses the SKLL machine learning library to train and evaluate the models. -
(Optional) Evaluate trained model. If you want to obtain detailed evaluation metrics for an RST parsing model on the development set, run:
rst_eval rst_discourse_tb_edus_TRAINING_DEV.json -p rst_parsing_model.C1.0 --use_gold_syntax
Of course, you could also use the test set here (
rst_discourse_tb_edus_TEST.json
) if you wished to do so.This command will compute precision, recall, and F1 scores for 3 scenarios: spans labeled with nuclearity and relation types, spans labeled only with nuclearity, and unlabeled token spans.
--use_gold_syntax
means that the command will use gold standard EDUs and syntactic parses.NOTE: While the evaluation script has basic functionality in place, at the moment it almost certainly does not appropriately handle important edge cases (e.g., same-unit relations, relations at the top of the tree).
Use trained models
At this point, we are ready to use the segmentation and RST parsing models to process raw text documents. Before we do that, you will need to download some models for the ZPar parser. RSTFinder uses ZPar to generate constituency parses for new documents. These models can be downloaded from here. Uncompress the models into a directory of your choice, say $HOME/zpar-models
.
Next, you need to set the following environment variables:
export NLTK_DATA="$HOME/nltk_data"
export ZPAR_MODEL_DIR="$HOME/zpar-models"
Now we are good to go! To process a raw text document document.txt
with the end-to-end parser (assuming C
= 1.0 was the best hyper-parameter value for both the segmentation and RST parsing models), run:
rst_parse -g segmentation_model.C1.0 -p rst_parsing_model.C1.0 document.txt > output.json
output.json
contains a dictionary with two keys: edu_tokens
and scored_rst_trees
. The value corresponding to edu_tokens
is a list of lists; each constituent list contains the tokens in an Elementary Discourse Unit (EDU) as computed by the segmenter. The value corresponding to rst_trees
is a list of dictionaries: each dictionary has two keys, tree
and score
containing the RST parse tree for the document and its score respectively. By default, only a single tree is produed but additonal trees can be produced by specifying the -n
option for rst_parse
.
RSTFinder can also produce an HTML/Javascript visualization of the RST parse tree using D3.js. To produce such a visualization from the JSON output file, run:
visualize_rst_tree output.json tree.html --embed_d3js
This will produce a self-contained file called tree.html
in the current directory that can be opened with any Javascript-enabled browser to see a visual representation of the RST parse tere.
License
This code is licensed under the MIT license (see LICENSE.txt).