• Stars
    star
    629
  • Rank 69,333 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 7 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Recurrent (conditional) generative adversarial networks for generating real-valued time series data.

RGAN

This repository contains code for the paper, Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs, by Stephanie L. Hyland* (@corcra), Cristóbal Esteban* (@cresteban), and Gunnar Rätsch (@ratsch), from the Ratschlab, also known as the Biomedical Informatics Group at ETH Zurich.

*Contributed equally, can't decide on name ordering

Paper Overview

Idea: Use generative adversarial networks (GANs) to generate real-valued time series, for medical purposes. As the title suggests. The GAN is RGAN because it uses recurrent neural networks for both encoder and decoder (specifically LSTMs).

What does this have to do with medicine?

We aim to generate time series from ICU patients, using the open-access eICU dataset. However, we also generate some non-medical time-series, like sine waves and smooth functions sampled from Gaussian Processes, and MNIST digits (imagined as a time series).

Why generating data at all?

Sharing medical data is hard, because it comes from real people, and is naturally highly sensitive (not to mention legally protected). One workaround for this difficultly would be to create sufficiently realistic synthetic data. This synthetic data could then be used to reproducibly develop and train machine learning models, enabling better science, and ultimately better models for medicine.

When is data 'sufficiently realistic'?

We claim in this paper, that synthetic data is useful when it can be used to train a model which can perform well on real data. So, we use the performance of a classifier trained on the synthetic data, then tested on real data as a measure of the quality of the data. We call this the "TSTR score". This is a way of evaluating the output of a GAN without relying on human perceptual judgements of individual samples.

Differential privacy

We also include the case where the GAN is trained in a differentially private manner, to provide stronger privacy guarantees for the training data. We mostly just use the differentially private SGD optimiser and the moments accountant from here (with some minor modifications).

Code Quickstart

Primary dependencies: tensorflow, scipy, numpy, pandas

Note: This code is written in Python3!

Simplest route to running code (Linux/Mac):

git clone [email protected]:ratschlab/RGAN.git
cd RGAN
python experiment.py --settings_file test

Note: the test settings file is a dummy to demonstrate which options exist, and may not produce reasonable looking output.

Expected Directory Structure

See the directories in this folder: https://github.com/ratschlab/RGAN/tree/master/experiments

Files in this Repository

The main script is experiment.py - this parses many options, loads and preprocesses data as needed, trains a model, and does evaluation. It does this by calling on some helper scripts:

  • data_utils.py: utilities pertaining to data: generating toy data (e.g. sine waves, GP samples), loading MNIST and eICU data, doing test/train split, normalising data, generating synthetic data to use in TSTR experiments
  • model.py: functions for defining ML models, i.e. the tensorflow meat, defines the generator and discriminator, the update steps, and functions for sampling from the model and 'inverting' points to find their latent-space representations
  • plotting.py: visualisation scripts using matplotlib
  • mmd.py: for maximum-mean discrepancy calculations, mostly taken from https://github.com/dougalsutherland/opt-mmd

Other scripts in the repo:

  • eICU_synthetic_dataset_generation.py: essentially self-contained script for training the RCGAN to generate synthetic eICU data
  • eICU_task.py: script to help identify a doable task in eICU, and generating the training data - feel free to experiment with different, harder tasks!
  • eICU_tstr_evaluation.py: for running the TSTR evaluation using pre-generated synthetic dataset
  • eugenium_mmd.py: code for doing MMD 3-sample tests, from https://github.com/eugenium/mmd
  • eval.py: functions for evaluating the RGAN/generated data, like testing if the RGAN has memorised the training data, comparing two models, getting reconstruction errors, and generating data for visualistions of things like varying the latent dimensions, interpolating between input samples
  • mod_core_rnn_cell_impl.py: this is a modification of the same script from TensorFlow, modified to allow us to initialise the bias in the LSTM (required for saving/loading models)
  • kernel.py: some playing around with kernels on time series
  • tf_ops.py: required by eugenium_mmd.py

There are plenty of functions in many of these files that weren't used for the manuscript.

Command line options

TODO

Data sources

MNIST

Get MNIST as CSVs here: https://pjreddie.com/projects/mnist-in-csv/

eICU

eICU is access-restricted, and must be applied for. For more information: http://eicu-crd.mit.edu/about/eicu/

TODO: describe how we preprocess eICU/upload script for doing it

More Repositories

1

SOM-VAE

TensorFlow implementation of the SOM-VAE model as described in https://arxiv.org/abs/1806.02199
Python
186
star
2

GP-VAE

TensorFlow implementation for the GP-VAE model described in https://arxiv.org/abs/1907.04155
Python
120
star
3

spladder

Tool for the detection and quantification of alternative splicing events from RNA-Seq data.
Python
101
star
4

metagraph

Scalable annotated de Bruijn graphs for DNA indexing, alignment, and assembly
C++
99
star
5

dpsom

Code associated with ACM-CHIL 21 paper 'T-DPSOM - An Interpretable Clustering Method for Unsupervised Learning of Patient Health States'
Python
64
star
6

bnn_priors

Code for the paper "Bayesian Neural Network Priors Revisited"
Python
53
star
7

circEWS

circEWS public code
Python
52
star
8

HIRID-ICU-Benchmark

Repository for the HiRID ICU Benchmark (HiB) project
Python
48
star
9

pancanatlas_code_public

Public repository containing research code for the TCGA PanCanAtlas Splicing project
Python
41
star
10

mmr

A tool for Read Multi-Mapper Resolution
C++
24
star
11

RiboDiff

RiboDiff: Tool to detect changes in translational efficiency based on ribosome footprinting data
Python
21
star
12

ncl

Code of the paper "Neighborhood Contrastive Learning Applied to Online Patient Monitoring"
Python
20
star
13

scim

Code for Universal Single-Cell Matching with Unpaired Feature Sets
Jupyter Notebook
18
star
14

SVGP-VAE

Tensorflow implementation for the SVGP-VAE model.
Python
18
star
15

repulsive_ensembles

Repo for our paper "Repulsive deep ensembles are Bayesian"
Jupyter Notebook
16
star
16

uRNN

Code for "Learning Unitary Operators with Help From u(n)", AAAI-17. (https://arxiv.org/abs/1607.04903)
Python
15
star
17

graph_annotation

Code accompanying the publication for compressed graph annotation
C++
13
star
18

pmvae

Code for pmVAE model, seen in ICML CompBio '21
Jupyter Notebook
12
star
19

tensor-sketch-alignment

Code for the paper Aligning Distant Sequences to Graphs using Long Seed Sketches.
C++
12
star
20

boosting-bbvi

Python
7
star
21

dgp-vae

Disentangled GP-VAE
Python
7
star
22

Project2020-seq-tensor-sketching

C++
7
star
23

secedo

Clustering tumor cells based on SNVs from single-cell sequencing data
C++
6
star
24

mlhc-seminar

Materials for a reading group on machine learning for healthcare and medicine.
5
star
25

oqtans_tools

Oqtans repository
C++
5
star
26

rDiff

Tests for Differential RNA Isoform Expression
C
5
star
27

counting_dbg

Lossless Indexing with Counting de Bruijn Graphs
Jupyter Notebook
5
star
28

tls

Code for paper Temporal Label Smoothing for Early Event Prediction (ICML 2023)
Python
5
star
29

clinical-embeddings

Repository for the Paper: „On the Importance of Step-wise Embeddings for Heterogeneous Clinical Time-Series“
Python
5
star
30

easysvm

The EasySVM Toolbox based on Shogun
Python
5
star
31

genome_graph_annotation

Sparse Binary Relation Representations for Genome Graph Annotation
C++
4
star
32

sim_read_until

Simulator of an ONT device with ReadUntil gRPC support
Jupyter Notebook
3
star
33

HMSVMToolbox

The Hidden Markov SVM Toolbox
MATLAB
3
star
34

PBWT-sec

C++ implementation of PBWT-seq
C++
3
star
35

projects-2020-Neural-SVGD

Nonparametric variational inference by transporting samples along a dynamically learned trajectory.
Jupyter Notebook
3
star
36

ratschlab-common

Library of common Python code used across various projects
Jupyter Notebook
2
star
37

adaptive-stepsize-boosting-bbvi

Python
2
star
38

mmugl

Code repository for MMUGL: Multi-modal Graph Learning over UMLS Knowledge Graphs
Python
2
star
39

projects2017-kG

Python
2
star
40

MiTie

The RNA-seq transcript predictor for multiple samples
C
2
star
41

mla

Scripts and data for reproducing the results of MetaGraph-MLA
Jupyter Notebook
2
star
42

oqtans

The master Oqtans repository with submodules
Shell
2
star
43

SNBNMF-mutsig-public

Supervised Negative Binomial NMF for Mutational Signature Discovery
Python
2
star
44

mSplicer

Accurate splice form prediction based on discriminative learning
C++
2
star
45

tools-gogo-gadget

A tool to aggregate custom command line tools into one.
Python
2
star
46

projects2020-disentangled-gpvae

Learning disentangled representations from time series.
Python
1
star
47

rQuant

The RNA-seq transcript quantifier with Bias Correction
MATLAB
1
star
48

AdaBoost

Adaboost-Reg and RBF-Network code
C
1
star
49

immunopepper

Code for the ImmunoPepper project
Python
1
star
50

seqCNN

embedding sequence using a convolutional network
Jupyter Notebook
1
star
51

metagraph_paper_resources

This repository contains resources related to the manuscript describing the MetaGraph framework
Jupyter Notebook
1
star
52

row_diff

RowDiff transform for sparsification of graph annotations
Jupyter Notebook
1
star
53

hif_splicing_code_public

public repo for the code used in "Characterisation of HIF-dependent alternative isoforms in pancreatic cancer"
R
1
star
54

gromics

Collection of tools and utilities for *omics analyses
Python
1
star
55

ASP

Accurate Splice Site Predictions
C++
1
star
56

palmapper

The accurate RNA-seq mapper
C++
1
star
57

metannot

Multithreaded wavelet trie construction library
C++
1
star
58

tools-cwl-workflow-experiments

Very simple workflows to experiment with containers and cwl
Python
1
star