• This repository has been archived on 08/Nov/2020
  • Stars
    star
    136
  • Rank 267,670 (Top 6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 7 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Neural network parametrized objective to disentangle and transfer style and content in text

Linguistic Style-Transfer

Neural network model to disentangle and transfer linguistic style in text


Prerequistites


Notes

  • Ignore CUDA_DEVICE_ORDER="PCI_BUS_ID", CUDA_VISIBLE_DEVICES="0" unless you're training with a GPU
  • Input data file format:
    • ${TEXT_FILE_PATH} should have 1 sentence per line.
    • Similarly, ${LABEL_FILE_PATH} should have 1 label per line.
  • Assuming that you already have g++ and bash installed, run the following commands to setup the kenlm library properly:
    • wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz
    • mkdir kenlm/build
    • cd kenlm/build
    • sudo apt-get install build-essential libboost-all-dev cmake zlib1g-dev libbz2-dev liblzma-dev (to install basic dependencies)
    • Install Boost:
      • Download boost_1_67_0.tar.bz2 from here
      • tar --bzip2 -xf /path/to/boost_1_67_0.tar.bz2
    • Install Eigen:
      • export EIGEN3_ROOT=$HOME/eigen-eigen-07105f7124f9
      • cd $HOME; wget -O - https://bitbucket.org/eigen/eigen/get/3.2.8.tar.bz2 |tar xj
      • Go back to the kenlm/build folder and run rm CMakeCache.txt
    • cmake ..
    • make -j2

Data Sources

Customer Review Datasets

  • Yelp Service Reviews - Link
  • Amazon Product Reviews - Link

Word Embeddings

References to ${VALIDATION_WORD_EMBEDDINGS_PATH} in the instructions below should be replaced by the path to the file glove.6B.100d.txt, which can be downloaded from here.

Opinion Lexicon

The file "data/opinion-lexicon/sentiment-words.txt", referenced in global_config.py can be downloaded from below page.


Pretraining

Run a corpus cleaner/adapter

./scripts/run_corpus_adapter.sh \
linguistic_style_transfer_model/corpus_adapters/${CORPUS_ADAPTER_SCRIPT}

Train word embedding model

./scripts/run_word_vector_training.sh \
--text-file-path ${TRAINING_TEXT_FILE_PATH} \
--model-file-path ${WORD_EMBEDDINGS_PATH}

Train validation classifier

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_classifier_training.sh \
--text-file-path ${TRAINING_TEXT_FILE_PATH} \
--label-file-path ${TRAINING_LABEL_FILE_PATH} \
--training-epochs ${NUM_EPOCHS} --vocab-size ${VOCAB_SIZE}

This will produce a folder like saved-models-classifier/xxxxxxxxxx.

Train Kneser-Ney Language Model

Use the below command to train a n-gram language model (run from the kenlm/build folder)

./bin/lmplz -o ${n} --text ${TRAINING_TEXT_FILE_PATH} > ${LANGUAGE_MODEL_PATH}

Extract label-correlated words

./scripts/run_word_retriever.sh \
--text-file-path ${TEXT_FILE_PATH} \
--label-file-path ${LABEL_FILE_PATH} \
--logging-level ${LOGGING_LEVEL}

Style Transfer Model Training

Train style transfer model

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_linguistic_style_transfer_model.sh \
--train-model \
--text-file-path ${TRAINING_TEXT_FILE_PATH} \
--label-file-path ${TRAINING_LABEL_FILE_PATH} \
--training-embeddings-file-path ${TRAINING_WORD_EMBEDDINGS_PATH} \
--validation-text-file-path ${VALIDATION_TEXT_FILE_PATH} \
--validation-label-file-path ${VALIDATION_LABEL_FILE_PATH} \
--validation-embeddings-file-path ${VALIDATION_WORD_EMBEDDINGS_PATH} \
--classifier-saved-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--dump-embeddings \
--training-epochs ${NUM_EPOCHS} \
--vocab-size ${VOCAB_SIZE} \
--logging-level="DEBUG"

This will produce a folder like saved-models/xxxxxxxxxx. It will also produce output/xxxxxxxxxx-training if validation is turned on.

Infer style transferred sentences

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_linguistic_style_transfer_model.sh \
--transform-text \
--evaluation-text-file-path ${TEST_TEXT_FILE_PATH} \
--saved-model-path ${SAVED_MODEL_PATH} \
--logging-level="DEBUG"

This will produce a folder like output/xxxxxxxxxx-inference.

Generate new sentences

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_linguistic_style_transfer_model.sh \
--generate-novel-text \
--saved-model-path ${SAVED_MODEL_PATH} \
--num-sentences-to-generate ${NUM_SENTENCES}
--logging-level="DEBUG"

This will produce a folder like output/xxxxxxxxxx-generation.


Visualizations

Plot validation accuracy metrics

./scripts/run_validation_scores_visualization_generator.sh \
--saved-model-path ${SAVED_MODEL_PATH}

This will produce a few files like ${SAVED_MODEL_PATH}/validation_xxxxxxxxxx.svg

Plot T-SNE embedding spaces

./scripts/run_tsne_visualization_generator.sh \
--saved-model-path ${SAVED_MODEL_PATH}

This will produce a few files like ${SAVED_MODEL_PATH}/tsne_plots/tsne_embeddings_plot_xx.svg


Run evaluation metrics

Style Transfer

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_style_transfer_evaluator.sh \
--classifier-saved-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--text-file-path ${GENERATED_TEXT_FILE_PATH} \
--label-index ${GENERATED_TEXT_LABEL}

Alternatively, if you have a file with the labels, use the below command instead

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_style_transfer_evaluator.sh \
--classifier-saved-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--text-file-path ${GENERATED_TEXT_FILE_PATH} \
--label-file-path ${GENERATED_LABELS_FILE_PATH}

Content Preservation

./scripts/run_content_preservation_evaluator.sh \
--embeddings-file-path ${VALIDATION_WORD_EMBEDDINGS_PATH} \
--source-file-path ${TEST_TEXT_FILE_PATH} \
--target-file-path ${GENERATED_TEXT_FILE_PATH}

Latent Space Predicted Label Accuracy

./scripts/run_label_accuracy_prediction.sh \
--gold-labels-file-path ${TEST_LABEL_FILE_PATH} \
--saved-model-path ${SAVED_MODEL_PATH} \
--predictions-file-path ${PREDICTIONS_LABEL_FILE_PATH}

Language Fluency

./scripts/run_language_fluency_evaluator.sh \
--language-model-path ${LANGUAGE_MODEL_PATH} \
--generated-text-file-path ${GENERATED_TEXT_FILE_PATH}

Log-likelihood values are base 10.

All Evaluation Metrics (works only for the output of this project)

CUDA_DEVICE_ORDER="PCI_BUS_ID" \
CUDA_VISIBLE_DEVICES="0" \
TF_CPP_MIN_LOG_LEVEL=1 \
./scripts/run_all_evaluators.sh \
--embeddings-path ${VALIDATION_WORD_EMBEDDINGS_PATH} \
--language-model-path ${LANGUAGE_MODEL_PATH} \
--classifier-model-path ${CLASSIFIER_SAVED_MODEL_PATH} \
--training-path ${SAVED_MODEL_PATH} \
--inference-path ${GENERATED_SENTENCES_SAVE_PATH}

More Repositories

1

daily-coding-problem

Solutions to problems sent by dailycodingproblem.com
Python
1,704
star
2

research-review-notes

Research Paper Review Notes
CSS
14
star
3

semeval2017-task5

Semeval 2017 Financial Sentiment Task 5 code
Python
11
star
4

cs007-stanford-notes

Notes for CS-007 at Stanford - Personal Finance for Engineers
9
star
5

invest-o-scrape

Term definition scraper for Investopedia
Python
8
star
6

attribute-based-text-generation

Natural language generation given categorical attribute values
Jupyter Notebook
6
star
7

wassa-emoint-2017

WASSA-2017 Shared Task on Emotion Intensity (EmoInt)
Jupyter Notebook
6
star
8

reddit-purge

Overwrite and delete personal comments on Reddit.com
Python
4
star
9

stock-correlated-news-harvester

Harvests Sentiment Annotated News inferred from stock patterns
Python
3
star
10

tf-generative-model

Generative text model for Tensorflow
Jupyter Notebook
2
star
11

articlebias-doc2vec

A natural language processing solution to detect financial article sentiment polarity
Python
2
star
12

monty-hall

Python
1
star
13

questrade-scripts

Personal reporting scripts for Questrade
Rust
1
star
14

financial-word-embedder

Application to learn financial word embeddings.
Jupyter Notebook
1
star
15

filename-formatter

Standardizes filename formats recursively, given a directory
Python
1
star
16

config-files

Shell
1
star
17

news-article-extractor

A generic implementation of a website article scraper
Java
1
star
18

ecom-sigir-task-2018

Python
1
star
19

movie-dialogue-generation

Movie dialog generation on the Cornell Movie-Dialogs Corpus
Jupyter Notebook
1
star
20

organization-sentiment-skill

Alexa skill to aggregate public domain sentiment for an organization from Tweets
Python
1
star
21

clojure-questrade

A small personal utility to calculate capital gains for non-registered investments at Questrade
Clojure
1
star