Taming Pretrained Transformers for XMC problems
This is a README for the experimental code of the following paper
Taming Pretrained Transformers for eXtreme Multi-label Text Classification
Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yiming Yang, Inderjit Dhillon
KDD 2020
Upates (2021-04-27)
Latest implementation (faster training with stronger performance) of X-Transformer is available at PECOS, feel free to try it out!
Installation
Depedencies via Conda Environment
> conda env create -f environment.yml
> source activate pt1.2_xmlc_transformer
> (pt1.2_xmlc_transformer) pip install -e .
> (pt1.2_xmlc_transformer) python setup.py install --force
**Notice: the following examples are executed under the > (pt1.2_xmlc_transformer)
conda virtual environment
Reproduce Evaulation Results in the Paper
We demonstrate how to reproduce the evaluation results in our paper by downloading the raw dataset and pretrained models.
Download Dataset (Eurlex-4K, Wiki10-31K, AmazonCat-13K, Wiki-500K)
Change directory into ./datasets folder, download and unzip each dataset
cd ./datasets
bash download-data.sh Eurlex-4K
bash download-data.sh Wiki10-31K
bash download-data.sh AmazonCat-13K
bash download-data.sh Wiki-500K
cd ../
Each dataset contains the following files
label_map.txt
: each line is the raw text of the labeltrain_raw_text.txt, test_raw_text.txt
: each line is the raw text of the instanceX.trn.npz, X.tst.npz
: instance's embedding matrix (either sparse TF-IDF or fine-tuned dense embedding)Y.trn.npz, Y.tst.npz
: instance-to-label assignment matrix
Download Pretrained Models (processed data, Indexing codes, fine-tuned Transformer models)
Change directory into ./pretrained_models folder, download and unzip models for each dataset
cd ./pretrained_models
bash download-models.sh Eurlex-4K
bash download-models.sh Wiki10-31K
bash download-models.sh AmazonCat-13K
bash download-models.sh Wiki-500K
cd ../
Each folder has the following strcture
proc_data
: a sub-folder containing: X.{trn|tst}.{model}.128.pkl, C.{label-emb}.npz, L.{label-emb}.npzpifa-tfidf-s0
: a sub-folder containing indexer and matcherpifa-neural-s0
: a sub-folder containing indexer and matchertext-emb-s0
: a sub-folder containing indexer and matcher
Evaluate Linear Models
Given the provided indexing codes (label-to-cluster assignments), train/predict linear models, and evaluate with Precision/Recall@k:
bash eval_linear.sh ${DATASET} ${VERSION}
DATASET
: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.VERSION
: v0=sparse TF-IDF features. v1=sparse TF-IDF features concatenate with dense fine-tuned XLNet embedding.
The evaluaiton results should located at
./results_linear/${DATASET}.${VERSION}.txt
Evaluate Fine-tuned X-Transformer Models
Given the provided indexing codes (label-to-cluster assignments) and the fine-tuned Transformer models, train/predict ranker of the X-Transformer framework, and evaluate with Precision/Recall@k:
bash eval_transformer.sh ${DATASET}
DATASET
: the dataset name such as Eurlex-4K, Wiki10-31K, AmazonCat-13K, or Wiki-500K.
The evaluaiton results should located at
./results_transformer/${DATASET}.final.txt
Running X-Transformer on customized datasets
The X-Transformer framework consists of 9 configurations (3 label-embedding times 3 model-type).
For simplicity, we show you 1 out-of 9 here, using LABEL_EMB=pifa-tfidf
and MODEL_TYPE=bert
.
We will use Eurlex-4K as an example. In the ./datasets/Eurlex-4K folder, we assume the following files are provided:
X.trn.npz
: the instance TF-IDF feature matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, D_tfidf), where N_trn is the number of train instances and D_tfidf is the number of features.X.tst.npz
: the instance TF-IDF feature matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, D_tfidf), where N_tst is the number of test instances and D_tfidf is the number of features.Y.trn.npz
: the instance-to-label matrix for the train set. The data type is scipy.sparse.csr_matrix of size (N_trn, L), where n_trn is the number of train instances and L is the number of labels.Y.tst.npz
: the instance-to-label matrix for the test set. The data type is scipy.sparse.csr_matrix of size (N_tst, L), where n_tst is the number of test instances and L is the number of labels.train_raw_texts.txt
: The raw text of the train set.test_raw_texts.txt
: The raw text of the test set.label_map.txt
: the label's text description.
Given those input files, the pipeline can be divided into three stages: Indexer, Matcher, and Ranker.
Indexer
In stage 1, we will do the following
- (1) construct label embedding
- (2) perform hierarchical 2-means and output the instance-to-cluster assignment matrix
- (3) preprocess the input and output for training Transformer models.
TLDR: we combine and summarize (1),(2),(3) into two scripts: run_preprocess_label.sh
and run_preprocess_feat.sh
. See more detailed explaination in the following.
(1) To construct label embedding,
OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
mkdir -p ${PROC_DATA_DIR}
python -m xbert.preprocess \
--do_label_embedding \
-i ${DATA_DIR} \
-o ${PROC_DATA_DIR} \
-l ${LABEL_EMB} \
-x ${LABEL_EMB_INST_PATH}
DATA_DIR
: ./datasets/Eurlex-4KPROC_DATA_DIR
: ./save_models/Eurlex-4K/proc_dataLABEL_EMB
: pifa-tfidf (you can also try text-emb or pifa-neural if you have fine-tuned instance embeddings)LABEL_EMB_INST_PATH
: ./datasets/Eurlex-4K/X.trn.npz
This should yield L.${LABEL_EMB}.npz
in the PROC_DATA_DIR
.
(2) To perform hierarchical 2-means,
SEED_LIST=( 0 1 2 )
for SEED in "${SEED_LIST[@]}"; do
LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
python -u -m xbert.indexer \
python -m xbert.preprocess \
-i ${PROC_DATA_DIR}/L.${LABEL_EMB}.npz \
-o ${INDEXER_DIR} --seed ${SEED}
This should yield code.npz
in the INDEXIER_DIR
.
(3) To preprocess input and output for Transformer models,
SEED=0
LABEL_EMB_NAME=${LABEL_EMB}-s${SEED}
INDEXER_DIR=${OUTPUT_DIR}/${LABEL_EMB_NAME}/indexer
python -u -m xbert.preprocess \
--do_proc_label \
-i ${DATA_DIR} \
-o ${PROC_DATA_DIR} \
-l ${LABEL_EMB_NAME} \
-c ${INDEXER_DIR}/code.npz
This should yield the instance-to-cluster matrix C.trn.npz
and C.tst.npz
in the PROC_DATA_DIR
.
OUTPUT_DIR=save_models/${DATASET}
PROC_DATA_DIR=${OUTPUT_DIR}/proc_data
python -u -m xbert.preprocess \
--do_proc_feat \
-i ${DATA_DIR} \
-o ${PROC_DATA_DIR} \
-m ${MODEL_TYPE} \
-n ${MODEL_NAME} \
--max_xseq_len ${MAX_XSEQ_LEN} \
|& tee ${PROC_DATA_DIR}/log.${MODEL_TYPE}.${MAX_XSEQ_LEN}.txt
MODEL_TYPE
: bert (or roberta, xlnet)MODEL_NAME
: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)MAX_XSEQ_LEN
: maximum number of tokens, we set to 128
This should yield X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt
and X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pt
in the PROC_DATA_DIR
.
Matcher
In stage 2, we will do the following
- (1) train deep Transformer models to map instances to the induced clusters
- (2) output the predicted cluster scores and fine-tune instance embeddings
TLDR: run_transformer_train.sh
. See more detailed explaination in the following.
(1) Assume we have 8 Nvidia V100 GPUs. To train the models,
MODEL_DIR=${OUTPUT_DIR}/${INDEXER_NAME}/matcher/${MODEL_NAME}
mkdir -p ${MODEL_DIR}
python -m torch.distributed.launch \
--nproc_per_node 8 xbert/transformer.py \
-m ${MODEL_TYPE} -n ${MODEL_NAME} --do_train \
-x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
-c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
-o ${MODEL_DIR} --overwrite_output_dir \
--per_device_train_batch_size ${PER_DEVICE_TRN_BSZ} \
--gradient_accumulation_steps ${GRAD_ACCU_STEPS} \
--max_steps ${MAX_STEPS} \
--warmup_steps ${WARMUP_STEPS} \
--learning_rate ${LEARNING_RATE} \
--logging_steps ${LOGGING_STEPS} \
|& tee ${MODEL_DIR}/log.txt
MODEL_TYPE
: bert (or roberta, xlnet)MODEL_NAME
: bert-large-cased-whole-word-masking (or roberta-large, xlnet-large-cased)PER_DEVICE_TRN_BSZ
: 16 if using Nvidia V100 (or set to 8 if using Nvidia 2080Ti)GRAD_ACCU_STEPS
: 2 if using Nvidia V100 (or set to 4 if using Nvidia 2080Ti)MAX_STEPS
: set to 1,000 for Eurlex-4K. Depending on your datasetsWARMUP_STEPS
: set to 1,00 for Eurlex-4K. Depending on your datasetsLEARNING_RATE
: set to 5e-5 for Eurlex-4K. Depending on your datasetsLOGGING_STEPS
: set to 100
(2) To generate predictions and instance embedding,
GPID=0,1,2,3,4,5,6,7
PER_DEVICE_VAL_BSZ=32
CUDA_VISIBLE_DEVICES=${GPID} python -u xbert/transformer.py
-m ${MODEL_TYPE} -n ${MODEL_NAME} \
--do_eval -o ${MODEL_DIR} \
-x_trn ${PROC_DATA_DIR}/X.trn.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
-c_trn ${PROC_DATA_DIR}/C.trn.${INDEXER_NAME}.npz \
-x_tst ${PROC_DATA_DIR}/X.tst.${MODEL_TYPE}.${MAX_XSEQ_LEN}.pkl \
-c_tst ${PROC_DATA_DIR}/C.tst.${INDEXER_NAME}.npz \
--per_device_eval_batch_size ${PER_DEVICE_VAL_BSZ}
This should yield the following output in the MODEL_DIR
C_trn_pred.npz
andC_tst_pred.npz
: model-predicted cluster scorestrn_embeddings.npy
andtst_embeddings.npy
: fine-tuned instance embeddings
Ranker
In stage 3, we will do the following
- (1) train linear rankers to map instances and predicted cluster scores to label scores
- (2) output top-k predicted labels
TLDR: run_transformer_predict.sh
. See more detailed explaination in the following.
(1) To train linear rankers,
LABEL_NAME=pifa-tfidf-s0
MODEL_NAME=bert-large-cased-whole-word-masking
OUTPUT_DIR=save_models/${DATASET}/${LABEL_NAME}
INDEXER_DIR=${OUTPUT_DIR}/indexer
MATCHER_DIR=${OUTPUT_DIR}/matcher/${MODEL_NAME}
RANKER_DIR=${OUTPUT_DIR}/ranker/${MODEL_NAME}
mkdir -p ${RANKER_DIR}
python -m xbert.ranker train \
-x1 ${DATA_DIR}/X.trn.npz \
-x2 ${MATCHER_DIR}/trn_embeddings.npy \
-y ${DATA_DIR}/Y.trn.npz \
-z ${MATCHER_DIR}/C_trn_pred.npz \
-c ${INDEXER_DIR}/code.npz \
-o ${RANKER_DIR} -t 0.01 \
-f 0 --mode ranker
(2) To predict the final top-k labels,
PRED_NPZ_PATH=${RANKER_DIR}/tst.pred.npz
python -m xbert.ranker predict \
-m ${RANKER_DIR} -o ${PRED_NPZ_PATH} \
-x1 ${DATA_DIR}/X.tst.npz \
-x2 ${MATCHER_DIR}/tst_embeddings.npy \
-y ${DATA_DIR}/Y.tst.npz \
-z ${MATCHER_DIR}/C_tst_pred.npz \
-f 0 -t noop
This should yield the predicted top-k labels tst.pred.npz specified in PRED_NPZ_PATH
.
Acknowledge
Some portions of this repo is borrowed from the following repos: