Benchmarking on MIMIC-III Dataset
Reference
Sanjay Purushotham*, Chuizheng Meng*, Zhengping Che, and Yan Liu. "Benchmarking Deep Learning Models on Large Healthcare Datasets." Journal of Biomedical Informatics (JBI). 2018.
An earlier version is available on arXiv (arXiv preprint arXiv:1710.08531).
Requirements
Database
You must have the database MIMIC-III running on your local machine or on a server. To construct the database locally, see the official guidance.
You have to fill necessary credentials to connect to the database:
cp preprocessing/config/connection_template.json preprocessing/config/connection.json
# Fill all keys listed under "mimiciii" in connection.json.
Packages
- For data preparation and SuperLearner:
- Run
bash install.sh
.
- Run
- For Feedforward Network and Feedforward Network+LSTM:
- Anaconda 2==4.4.0
- Theano==0.9.0
- Keras==2.0.6
Prepare data for benchmarking
Generate input files
python -m preprocessing.preprocess --cachedir data --num_workers <number of processes>
The preprocessing should finish within 1 day with --num_workers 4
.
All input files are stored under data/
and require 36GB disk space.
We also provide processed data upon request (only containing necessary files for training following models). Please contact Chuizheng Meng ([email protected]) for the processed data.
Evaluate performance
SuperLearner(Python version)
By default the path to the main program of SuperLearner(Python Version) is [RD]/Codes/SuperLearnerPyVer/python/superlearner_pyver.py
.
Use the following command to run SuperLearner(Python version):
python [path to the main program('superlearner_pyver.py') of SuperLearner(Python version)] [path to 'non_series' folder of a dataset] [path to 'series' folder of a dataset] [task name, 'mor'/'los'/'icd9'] [rank of label] [name of subset, 'all'/'cv'/'mv'] [method name, 'sl1'/'sl2']
For example, run SuperLearner-II with 17 raw features from MIMIC-III full dataset on length of stay prediction task:
python superlearner_pyver.py ../../../Data/admdata_17f/24hrs_raw/non_series ../../../Data/admdata_17f/24hrs_raw/series los 0 all sl2
The result is saved with name pyslresults-[task name]-[subset name]-[method name].npz
under path to 'non_series' folder of a dataset. You can use 13_metrics_from_saved_results.ipynb
to calculate the metrics.
FFN: Feedforward Network
By default the path to the main program is [RD]/Codes/DeepLearningModels/python/betterlearner.py
.
Use the following command to run feedforward network on specific dataset with fine-tuned hyperparameters:
python [path to the main program('betterlearner.py')] [name of dataset] [task name] 2 [name of imputed data] [name of fold data] [name of stats of imputed data] --label_type [label type] --static_features_path [path to static features, ‘input.csv’] --static_hidden_dim [2048 for 136 raw features, do not set this for other feature sets] --static_ffn_depth 2 --batch_size 100 --nb_epoch 250 --early_stopping True_BestWeight --early_stopping_patience 20 --batch_normalization True --learning_rate 0.001
For example, to run Feedforward Network with 136 raw features from MIMIC-III full dataset on in-hospital mortality prediction task:
python betterlearner.py mimic3_99p_raw_24h mor 2 imputed-normed-ep_1_24.npz 5-folds.npz imputed-normed-ep_1_24-stdized.npz --label_type 0 --static_features_path ../../../Data/admdata_99p/24hrs_raw/non_series/input.csv --static_hidden_dim 2048 --static_ffn_depth 2 --batch_size 100 --nb_epoch 250 --early_stopping True_BestWeight --early_stopping_patience 20 --batch_normalization True --learning_rate 0.001
LSTM: LSTM only
By default the path to the main program is [RD]/Codes/DeepLearningModels/python/betterlearner.py
.
Use the following command to run LSTM on specific dataset with fine-tuned hyperparameters:
python [path to the main program('betterlearner.py')] [name of dataset] [task name] 1 [name of imputed data] [name of fold data] [name of stats of imputed data] --label_type 0 --without_static --output_dim 2 --batch_size 100 --nb_epoch 250 --early_stopping True_BestWeight --early_stopping_patience 20 --batch_normalization True --learning_rate [0.001 for mortality and icd9 prediction and 0.005 for length of stay prediction] --dropout 0.1
For example, to run LSTM with 17 processed features from MIMIC-III full dataset on in-hospital mortality prediction task:
python betterlearner.py mimic3_17f_24h mor 1 imputed-normed-ep_1_24.npz 5-folds.npz imputed-normed-ep_1_24-stdized.npz --label_type 0 --without_static --output_dim 2 --batch_size 100 --nb_epoch 250 --early_stopping True_BestWeight --early_stopping_patience 20 --batch_normalization True --learning_rate 0.001 --dropout 0.1
MMDL: Feedforward Network+LSTM
By default the path to the main program is [RD]/Codes/DeepLearningModels/python/betterlearner.py
.
Use the following command to run Feedforward Network+LSTM on specific dataset with fine-tuned hyperparameters:
python [path to the main program('betterlearner.py')] [name of dataset] [task name] 1 [name of imputed data] [name of fold data] [name of stats of imputed data] --label_type 0 --ffn_depth 1 --merge_depth 0 --output_dim 2 --batch_size 100 --nb_epoch 250 --early_stopping True_BestWeight --early_stopping_patience 20 --batch_normalization True --learning_rate [0.001 for mortality and icd9 prediction and 0.005 for length of stay prediction] --dropout 0.1
For example, to run Feedforward Network+LSTM with 17 processed features from MIMIC-III full dataset on in-hospital mortality prediction task:
python betterlearner.py mimic3_17f_24h mor 1 imputed-normed-ep_1_24.npz 5-folds.npz imputed-normed-ep_1_24-stdized.npz --label_type 0 --ffn_depth 1 --merge_depth 0 --output_dim 2 --batch_size 100 --nb_epoch 250 --early_stopping True_BestWeight --early_stopping_patience 20 --batch_normalization True --learning_rate 0.001 --dropout 0.1
Score methods(SAPS-II, Modified SAPS-II and SOFA)
Run 13_get_score-results_firstXhrs_17-features-processed.ipynb
to calculate metrics for score methods.