Linguistic Style-Transfer

Neural network model to disentangle and transfer linguistic style in text

Prerequistites

Notes

Ignore CUDA_DEVICE_ORDER="PCI_BUS_ID", CUDA_VISIBLE_DEVICES="0" unless you're training with a GPU
Input data file format:
- ${TEXT_FILE_PATH} should have 1 sentence per line.
- Similarly, ${LABEL_FILE_PATH} should have 1 label per line.
Assuming that you already have g++ and bash installed, run the following commands to setup the kenlm library properly:
- wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
- mkdir kenlm/build
- cd kenlm/build
- sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev (to install basic dependencies)
- Install Boost:
  - Download boost_1_67_0.tar.bz2 from here
  - tar --bzip2 -xf /path/to/boost_1_67_0.tar.bz2
- Install Eigen:
  - export EIGEN3_ROOT=$HOME/eigen-eigen-07105f7124f9
  - cd $HOME; wget -O - https://bitbucket.org/eigen/eigen/get/3.2.8.tar.bz2 |tar xj
  - Go back to the kenlm/build folder and run rm CMakeCache.txt
- cmake ..
- make -j2

Data Sources

Customer Review Datasets

Yelp Service Reviews - Link
Amazon Product Reviews - Link

Word Embeddings

References to ${VALIDATION_WORD_EMBEDDINGS_PATH} in the instructions below should be replaced by the path to the file glove.6B.100d.txt, which can be downloaded from here.

Opinion Lexicon

The file "data/opinion-lexicon/sentiment-words.txt", referenced in global_config.py can be downloaded from below page.

Pretraining

Run a corpus cleaner/adapter

./scripts/run_corpus_adapter.sh \
linguistic_style_transfer_model/corpus_adapters/${CORPUS_ADAPTER_SCRIPT}

Train word embedding model

./scripts/run_word_vector_training.sh \
--text-file-path ${TRAINING_TEXT_FILE_PATH} \
--model-file-path ${WORD_EMBEDDINGS_PATH}

Train validation classifier

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_classifier_training.sh \
--text-file-path ${TRAINING_TEXT_FILE_PATH} \
--label-file-path ${TRAINING_LABEL_FILE_PATH} \
--training-epochs ${NUM_EPOCHS} --vocab-size ${VOCAB_SIZE}

This will produce a folder like saved-models-classifier/xxxxxxxxxx.

Train Kneser-Ney Language Model

Use the below command to train a n-gram language model (run from the kenlm/build folder)

./bin/lmplz -o ${n} --text ${TRAINING_TEXT_FILE_PATH} > ${LANGUAGE_MODEL_PATH}

Extract label-correlated words

./scripts/run_word_retriever.sh \
--text-file-path ${TEXT_FILE_PATH} \
--label-file-path ${LABEL_FILE_PATH} \
--logging-level ${LOGGING_LEVEL}

Style Transfer Model Training

Train style transfer model

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_linguistic_style_transfer_model.sh \
--train-model \
--text-file-path ${TRAINING_TEXT_FILE_PATH} \
--label-file-path ${TRAINING_LABEL_FILE_PATH} \
--training-embeddings-file-path ${TRAINING_WORD_EMBEDDINGS_PATH} \
--validation-text-file-path ${VALIDATION_TEXT_FILE_PATH} \
--validation-label-file-path ${VALIDATION_LABEL_FILE_PATH} \
--validation-embeddings-file-path ${VALIDATION_WORD_EMBEDDINGS_PATH} \
--classifier-saved-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--dump-embeddings \
--training-epochs ${NUM_EPOCHS} \
--vocab-size ${VOCAB_SIZE} \
--logging-level="DEBUG"

This will produce a folder like saved-models/xxxxxxxxxx. It will also produce output/xxxxxxxxxx-training if validation is turned on.

Infer style transferred sentences

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_linguistic_style_transfer_model.sh \
--transform-text \
--evaluation-text-file-path ${TEST_TEXT_FILE_PATH} \
--saved-model-path ${SAVED_MODEL_PATH} \
--logging-level="DEBUG"

This will produce a folder like output/xxxxxxxxxx-inference.

Generate new sentences

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_linguistic_style_transfer_model.sh \
--generate-novel-text \
--saved-model-path ${SAVED_MODEL_PATH} \
--num-sentences-to-generate ${NUM_SENTENCES}
--logging-level="DEBUG"

This will produce a folder like output/xxxxxxxxxx-generation.

Visualizations

Plot validation accuracy metrics

./scripts/run_validation_scores_visualization_generator.sh \
--saved-model-path ${SAVED_MODEL_PATH}

This will produce a few files like ${SAVED_MODEL_PATH}/validation_xxxxxxxxxx.svg

Plot T-SNE embedding spaces

./scripts/run_tsne_visualization_generator.sh \
--saved-model-path ${SAVED_MODEL_PATH}

This will produce a few files like ${SAVED_MODEL_PATH}/tsne_plots/tsne_embeddings_plot_xx.svg

Run evaluation metrics

Style Transfer

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_style_transfer_evaluator.sh \
--classifier-saved-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--text-file-path ${GENERATED_TEXT_FILE_PATH} \
--label-index ${GENERATED_TEXT_LABEL}

Alternatively, if you have a file with the labels, use the below command instead

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_style_transfer_evaluator.sh \
--classifier-saved-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--text-file-path ${GENERATED_TEXT_FILE_PATH} \
--label-file-path ${GENERATED_LABELS_FILE_PATH}

Content Preservation

./scripts/run_content_preservation_evaluator.sh \
--embeddings-file-path ${VALIDATION_WORD_EMBEDDINGS_PATH} \
--source-file-path ${TEST_TEXT_FILE_PATH} \
--target-file-path ${GENERATED_TEXT_FILE_PATH}

Latent Space Predicted Label Accuracy

./scripts/run_label_accuracy_prediction.sh \
--gold-labels-file-path ${TEST_LABEL_FILE_PATH} \
--saved-model-path ${SAVED_MODEL_PATH} \
--predictions-file-path ${PREDICTIONS_LABEL_FILE_PATH}

Language Fluency

./scripts/run_language_fluency_evaluator.sh \
--language-model-path ${LANGUAGE_MODEL_PATH} \
--generated-text-file-path ${GENERATED_TEXT_FILE_PATH}

Log-likelihood values are base 10.

All Evaluation Metrics (works only for the output of this project)

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_all_evaluators.sh \
--embeddings-path ${VALIDATION_WORD_EMBEDDINGS_PATH} \
--language-model-path ${LANGUAGE_MODEL_PATH} \
--classifier-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--training-path ${SAVED_MODEL_PATH} \
--inference-path ${GENERATED_SENTENCES_SAVE_PATH}

vineetjohn/linguistic-style-transfer

vineetjohn

Reviews

Repository Details