deepDiagnosis
A torch package for learning diagnosis models from temporal patient data.
For more details please check:
Narges Razavian, Jake Marcus, David Sontag,"Multi-task Prediction of Disease Onsets from Longitudinal Lab Tests", Machine Learning and Healthcare, 2016
Narges Razavian, David Sontag, "Temporal Convolutional Neural Networks for Diagnosis from Lab Tests", ICLR 2016 Workshop track.
#Installation:
The package has the following dependencies:
LUA: Torch, cunn, nn, cutorch, gnuplot, optim, and rnn
#Usage:
Run the following in order. Creating datasets can be done in parallel over train/test/valid tasks. Up to you.
There are sample input files (./sample_python_data) that you can use to test the package first.
1) python create_torch_tensors.py --x sample_python_data/xtrain.pkl --y sample_python_data/ytrain.pkl --task 'train' --outdir ./sampledata/
2) python create_torch_tensors.py --x sample_python_data/xtest.pkl --y sample_python_data/ytest.pkl --task 'test' --outdir ./sampledata/
3) python create_torch_tensors.py --x sample_python_data/xvalid.pkl --y sample_python_data/yvalid.pkl --task 'valid' --outdir ./sampledata/
4) th create_batches.lua --task=train --input_dir=./sampledata --batch_output_dir=./sampleBatchDir
5) th create_batches.lua --task=valid --input_dir=./sampledata --batch_output_dir=./sampleBatchDir
6) th create_batches.lua --task=scoretrain --input_dir=./sampledata --batch_output_dir=./sampleBatchDir
7) th create_batches.lua --task=test --input_dir=./sampledata --batch_output_dir=./sampleBatchDir
8) th train_and_validate.lua --task=train --input_batch_dir=./sampleBatchDir --save_models_dir=./sample_models/
Once the model is trained, run the following to get final evaluations on test set: (change the "lstm2016_05_29_10_11_01" into the model directory that you have created in step 8. Training directories have timestamp.)
9) th train_and_validate.lua --task=test --validation_dir=./sample_models/lstm2016_05_29_10_11_01/
Read the following for details on how to define your cohort and task.
#Input: Input should be one of the two formats described below:
Read below for the details:
Format 1) Python nympy arrays (also support cPickle) of size
xtrain, xvalid, xtest: |labs| x |people| x |cohort time| for creating the input batches
ytrain, yvalid, ytest: |diseases| x |people| x |cohort time| for creating the output batches and inclusion/exclusion for each batch member
Format 2) Python numpy arrays (also support cPickle) of size
xtrain, xvalid, xtest: |Labs| x |people| x |cohort time| for the output
ytrain, yvalid, ytest: |diseases| x |people| for the output, where we do not have a concept of time.
(Note that in format 2 you can also provide exclusion-per-disease for input. If you need that version, let me know and I'll update that part immediately.)
Format 3) advanced shelve databases, for our internal use.
Please refer to https://github.com/clinicalml/ckd_progression for details.
#Prediction Models:
Currently the following models are supported. The details of the architectures are included in the citation paper below.
- Logistic Regression (--model=max_logit)
- Feedforward network (--model=mlp)
- Temporal Convolutional neural network over a backward window (--model=convnet)
- Convolutional neural network over input and time dimension (--model=convnet_mix)
- Multi-resolution temporal convolutional neural network (--model=multiresconvnet)
- LSTM network over the backward window (--model=lstmlast) (note: a version --model=lstmall is also available but we found training with lstmlast gives better results)
- Ensemble of multiple models (to be added soon)
#Synthetic Input for testing the package
You can use the following to create synthetic numpy arrays to test the package;
python create_synthetic_data.py --outdir ./sample_python_data --N 6000 --D 15 --T 48 --O 20
This code will create 3 datasets (train, test, valid) in the ./sample_python_data directory, with dimensions of: 5 x 2000 x 48 for each input x (xtrain, xtest, xvalid) and 20 x 2000 x 48 for each outcome set y. This synthetic data corresponds to input type 1 above. Follow steps 1-9 in the (Run) section above to test with this data, and feel free to test with other synthetic datasets.
#Citation: @article{razavian2016temporal, title={Multi-task Prediction of Disease Onsets from Longitudinal Lab Tests}, author={Razavian, Narges and Marcus,Jake and Sontag, David}, journal={1st Conference on Machine Learning and Health Care (MLHC)}, year={2016} }
@article{razavian2015temporal,
title={Temporal Convolutional Neural Networks for Diagnosis from Lab Tests},
author={Razavian, Narges and Sontag, David},
journal={arXiv preprint arXiv:1511.07938},
year={2015}
}
#Bug reports, questions, and Contact:
For any questions please email: narges razavian [[email protected] or https://github.com/narges-rzv/]