Linguistic Style-Transfer
Neural network model to disentangle and transfer linguistic style in text
Prerequistites
Notes
- Ignore
CUDA_DEVICE_ORDER="PCI_BUS_ID"
,CUDA_VISIBLE_DEVICES="0"
unless you're training with a GPU - Input data file format:
${TEXT_FILE_PATH}
should have 1 sentence per line.- Similarly,
${LABEL_FILE_PATH}
should have 1 label per line.
- Assuming that you already have g++ and bash installed, run the following commands to setup the kenlm library properly:
wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
mkdir kenlm/build
cd kenlm/build
sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev
(to install basic dependencies)- Install Boost:
- Download boost_1_67_0.tar.bz2 from here
tar --bzip2 -xf /path/to/boost_1_67_0.tar.bz2
- Install Eigen:
export EIGEN3_ROOT=$HOME/eigen-eigen-07105f7124f9
cd $HOME; wget -O - https://bitbucket.org/eigen/eigen/get/3.2.8.tar.bz2 |tar xj
- Go back to the
kenlm/build
folder and runrm CMakeCache.txt
cmake ..
make -j2
Data Sources
Customer Review Datasets
Word Embeddings
References to ${VALIDATION_WORD_EMBEDDINGS_PATH}
in the instructions below should be replaced by the path to the file glove.6B.100d.txt
, which can be downloaded from here.
Opinion Lexicon
The file "data/opinion-lexicon/sentiment-words.txt"
, referenced in global_config.py can be downloaded from below page.
Pretraining
Run a corpus cleaner/adapter
./scripts/run_corpus_adapter.sh \
linguistic_style_transfer_model/corpus_adapters/${CORPUS_ADAPTER_SCRIPT}
Train word embedding model
./scripts/run_word_vector_training.sh \
--text-file-path ${TRAINING_TEXT_FILE_PATH} \
--model-file-path ${WORD_EMBEDDINGS_PATH}
Train validation classifier
CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_classifier_training.sh \
--text-file-path ${TRAINING_TEXT_FILE_PATH} \
--label-file-path ${TRAINING_LABEL_FILE_PATH} \
--training-epochs ${NUM_EPOCHS} --vocab-size ${VOCAB_SIZE}
This will produce a folder like saved-models-classifier/xxxxxxxxxx
.
Train Kneser-Ney Language Model
Use the below command to train a n
-gram language model (run from the kenlm/build
folder)
./bin/lmplz -o ${n} --text ${TRAINING_TEXT_FILE_PATH} > ${LANGUAGE_MODEL_PATH}
Extract label-correlated words
./scripts/run_word_retriever.sh \
--text-file-path ${TEXT_FILE_PATH} \
--label-file-path ${LABEL_FILE_PATH} \
--logging-level ${LOGGING_LEVEL}
Style Transfer Model Training
Train style transfer model
CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_linguistic_style_transfer_model.sh \
--train-model \
--text-file-path ${TRAINING_TEXT_FILE_PATH} \
--label-file-path ${TRAINING_LABEL_FILE_PATH} \
--training-embeddings-file-path ${TRAINING_WORD_EMBEDDINGS_PATH} \
--validation-text-file-path ${VALIDATION_TEXT_FILE_PATH} \
--validation-label-file-path ${VALIDATION_LABEL_FILE_PATH} \
--validation-embeddings-file-path ${VALIDATION_WORD_EMBEDDINGS_PATH} \
--classifier-saved-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--dump-embeddings \
--training-epochs ${NUM_EPOCHS} \
--vocab-size ${VOCAB_SIZE} \
--logging-level="DEBUG"
This will produce a folder like saved-models/xxxxxxxxxx
.
It will also produce output/xxxxxxxxxx-training
if validation is turned on.
Infer style transferred sentences
CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_linguistic_style_transfer_model.sh \
--transform-text \
--evaluation-text-file-path ${TEST_TEXT_FILE_PATH} \
--saved-model-path ${SAVED_MODEL_PATH} \
--logging-level="DEBUG"
This will produce a folder like output/xxxxxxxxxx-inference
.
Generate new sentences
CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_linguistic_style_transfer_model.sh \
--generate-novel-text \
--saved-model-path ${SAVED_MODEL_PATH} \
--num-sentences-to-generate ${NUM_SENTENCES}
--logging-level="DEBUG"
This will produce a folder like output/xxxxxxxxxx-generation
.
Visualizations
Plot validation accuracy metrics
./scripts/run_validation_scores_visualization_generator.sh \
--saved-model-path ${SAVED_MODEL_PATH}
This will produce a few files like ${SAVED_MODEL_PATH}/validation_xxxxxxxxxx.svg
Plot T-SNE embedding spaces
./scripts/run_tsne_visualization_generator.sh \
--saved-model-path ${SAVED_MODEL_PATH}
This will produce a few files like ${SAVED_MODEL_PATH}/tsne_plots/tsne_embeddings_plot_xx.svg
Run evaluation metrics
Style Transfer
CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_style_transfer_evaluator.sh \
--classifier-saved-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--text-file-path ${GENERATED_TEXT_FILE_PATH} \
--label-index ${GENERATED_TEXT_LABEL}
Alternatively, if you have a file with the labels, use the below command instead
CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_style_transfer_evaluator.sh \
--classifier-saved-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--text-file-path ${GENERATED_TEXT_FILE_PATH} \
--label-file-path ${GENERATED_LABELS_FILE_PATH}
Content Preservation
./scripts/run_content_preservation_evaluator.sh \
--embeddings-file-path ${VALIDATION_WORD_EMBEDDINGS_PATH} \
--source-file-path ${TEST_TEXT_FILE_PATH} \
--target-file-path ${GENERATED_TEXT_FILE_PATH}
Latent Space Predicted Label Accuracy
./scripts/run_label_accuracy_prediction.sh \
--gold-labels-file-path ${TEST_LABEL_FILE_PATH} \
--saved-model-path ${SAVED_MODEL_PATH} \
--predictions-file-path ${PREDICTIONS_LABEL_FILE_PATH}
Language Fluency
./scripts/run_language_fluency_evaluator.sh \
--language-model-path ${LANGUAGE_MODEL_PATH} \
--generated-text-file-path ${GENERATED_TEXT_FILE_PATH}
Log-likelihood values are base 10.
All Evaluation Metrics (works only for the output of this project)
CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_all_evaluators.sh \
--embeddings-path ${VALIDATION_WORD_EMBEDDINGS_PATH} \
--language-model-path ${LANGUAGE_MODEL_PATH} \
--classifier-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--training-path ${SAVED_MODEL_PATH} \
--inference-path ${GENERATED_SENTENCES_SAVE_PATH}