• Stars
    star
    472
  • Rank 93,034 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A wrapper around tensor2tensor to flexibly train, interact, and generate data for neural chatbots.

Seq2seqChatbots ยท twitter

Paper2 Paper1 Poster Code1 Code2 notes documentation blog
A wrapper around tensor2tensor to flexibly train, interact, and generate data for neural chatbots.
The wiki contains my notes and summaries of over 150 recent publications related to neural dialog modeling.

Features

๐Ÿ’พ ย  Run your own trainings or experiment with pre-trained models
โœ… ย  4 different dialog datasets integrated with tensor2tensor
๐Ÿ”€ ย  Seemlessly works with any model or hyperparameter set in tensor2tensor
๐Ÿš€ ย ย  Easily extendable base class for dialog problems

Setup

Run setup.py which installs required packages and steps you through downloading additional data:

python setup.py

You can download all trained models used in this paper from here. Each training contains two checkpoints, one for the validation loss minimum and another after 150 epochs. The data and the trainings folder structure match each other exactly.

Usage

python t2t_csaky/main.py --mode=train

The mode argument can be one of the following four: {generate_data, train, decode, experiment}. In the experiment mode you can speficy what to do inside the experiment function of the run file. A detailed explanation is given below, for what each mode does.

Config

You can control the flags and parameters of each mode directly in this file. For each run that you initiate this file will be copied to the appropriate directory, so you can quickly access the parameters of any run. There are some flags that you have to set for every mode (the FLAGS dictionary in the config file):

  • t2t_usr_dir: Path to the directory where my code resides. You don't have to change this, unless you rename the directory.
  • data_dir: The path to the directory where you want to generate the source and target pairs, and other data. The dataset will be downloaded one level higher from this directory into a raw_data folder.
  • problem: This is the name of a registered problem that tensor2tensor needs. Detailed in the generate_data section below. All paths should be from the root of the repo.

Generate Data

This mode will download and preprocess the data and generate source and target pairs. Currently there are 6 registered problems, that you can use besides the ones given by tensor2tensor:

The PROBLEM_HPARAMS dictionary in the config file contains problem specific parameters that you can set before generating data:

  • num_train_shards/num_dev_shards: If you want the generated train or dev data to be sharded over several files.
  • vocabulary_size: Size of the vocabulary that we want to use for the problem. Words outside this vocabulary will be replaced with the token.
  • dataset_size: Number of utterance pairs, if we don't want to use the full dataset (defined by 0).
  • dataset_split: Specify a train-val-test split for the problem.
  • dataset_version: This is only relevant to the opensubtitles dataset, since there are several versions of this dataset, you can specify the year of the dataset that you want to download.
  • name_vocab_size: This is only relevant to the cornell problem with separate names. You can set the size of the vocabulary containing only the personas.

Train

This mode allows you to train a model with the specified problem and hyperparameters. The code just calls the tensor2tensor training script, so any model that is in tensor2tensor can be used. Besides these, there is also a subclassed model with small modifications:

  • gradient_checkpointed_seq2seq: Small modification of the lstm based seq2seq model, so that own hparams can be used entirely. Before calculating the softmax the LSTM hidden units are projected to 2048 linear units as here. Finally, I tried to implement gradient checkpointing to this model, but currently it is taken out since it didn't give good results.

There are several additional flags that you can specify for a training run in the FLAGS dictionary in the config file, some of which are:

  • train_dir: Name of the directory where the training checkpoint files will be saved.
  • model: Name of the model: either one of the above or a tensor2tensor defined model.
  • hparams: Specify a registered hparams_set, or leave empty if you want to define hparams in the config file. In order to specify hparams for a seq2seq or transformer model, you can use the SEQ2SEQ_HPARAMS and TRANSFORMER_HPARAMS dictionaries in the config file (check it for more details).

Decode

With this mode you can decode from the trained models. The following parameters affect the decoding (in the FLAGS dictionary in the config file):

  • decode_mode: Can be interactive, where you can chat with the model using the command line. file mode allows you to specify a file with source utterances for which to generate responses, and dataset mode will randomly sample the validation data provided and output responses.
  • decode_dir: Directory where you can provide file to decode from, and outputted responses will be saved here.
  • input_file_name: Name of the file that you have to give in file mode (placed in the decode_dir).
  • output_file_name: Name of the file, inside decode_dir, where output responses will be saved.
  • beam_size: Size of the beam, when using beam search.
  • return_beams: If False return only the top beam, otherwise return beam_size number of beams.

Results & Examples

The following results are from these two papers.

Loss and Metrics of Transformer Trained on Cornell


TRF is the Transformer model, while RT means randomly selected responses from the training set and GT means ground truth responses. For an explanation of the metrics see the paper.

Responses from Transformer and Seq2seq Trained on Cornell and Opensubtitles

S2S is a simple seq2seq model with LSTMs trained on Cornell, others are Transformer models. Opensubtitles F is pre-trained on Opensubtitles and finetuned on Cornell.

Loss and Metrics of Transformer Trained on DailyDialog


TRF is the Transformer model, while RT means randomly selected responses from the training set and GT means ground truth responses. For an explanation of the metrics see the paper.

Responses from Transformer Trained on DailyDialog

Contributing

Check the issues for some additions where help is appreciated. Any contributions are welcome โค๏ธ
Please try to follow the code syntax style used in the repo (flake8, 2 spaces indent, 80 char lines, commenting a lot, etc.)

New problems can be registered by subclassing WordChatbot, or even better to subclass CornellChatbotBasic or OpensubtitleChatbot, because they implement some additional functionalities. Usually it's enough to override the preprocess and create_data functions. Check the documentation for more details and see daily_dialog_chatbot for an example.

New models and hyperparameters can be added by following the tensor2tensor tutorial.

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details.
Please include a link to this repo if you use it in your work and consider citing the following paper:

@InProceedings{Csaky:2017,
  title = {Deep Learning Based Chatbot Models},
  author = {Csaky, Richard},
  year = {2019},
  publisher={National Scientific Students' Associations Conference},
  url ={https://tdk.bme.hu/VIK/DownloadPaper/asdad},
  note={https://tdk.bme.hu/VIK/DownloadPaper/asdad}
}