• Stars
    star
    127
  • Rank 282,790 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created about 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A toolbox for differentially private data generation

Private Data Generation Toolbox

The goal of this toolbox is to make private generation of synthetic data samples accessible to machine learning practitioners. It currently implements 5 state of the art generative models that can generate differentially private synthetic data. We evaluate the models on 4 public datasets from domains where privacy of sensitive data is paramount. Users can benchmark the models on the existing datasets or feed a new sensitive dataset as an input and get a synthetic dataset as the output which can be distributed to third parties with strong differential privacy guarantees.

Models :

PATE-GAN : PATE-GAN : Generating Synthetic Data with Differential Privacy Guarantees. ICLR 2019

DP-WGAN : Implementation of private Wasserstein GAN using noisy gradient descent moments accountant.

RON-GAUSS : Enhancing Utility in Non-Interactive Private Data Release, Proceedings on Privacy Enhancing Technologies (PETS), vol. 2019, no. 1, 2018

Private IMLE : Implementation of private Implicit Maximum Likelihood Estimation using noisy gradient descent and moments accountant.

Private PGM : Graphical-model based estimation and inference for differential privacy. Proceedings of the 36th International Conference on Machine Learning. 2019.

NOTE : Private IMLE code is released separately from this toolbox and can be found here : https://github.com/BorealisAI/IMLE. To run IMLE, do the following first:

git clone https://github.com/BorealisAI/IMLE.git  
cp -r IMLE <root>/models

Also make sure to follow the build instructions in <root>/models/IMLE/dci_code/Makefile

Dataset description :

Adult Census : The dataset comprises of census attributes like age, gender, native country etc and the goal is to predict whether a person earns more than $ 50k a year or not. https://archive.ics.uci.edu/ml/datasets/adult

NHANES Diabetes : National Health and Nutrition Examination Survey (NHANES) questionnaire is used to predict the onset of type II diabetes. https://github.com/semerj/NHANES-diabetes/tree/master/data

Give Me Some Credit : Historical data are provided on 250,000 borrowers and task is to help in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. https://www.kaggle.com/c/GiveMeSomeCredit/data

Home Credit Default Risk : Home Credit makes use of a variety of alternative data including telco and transactional information along with the client's past financial record to predict their clients' repayment abilities. https://www.kaggle.com/c/home-credit-default-risk/data

Adult Categorical : This dataset is the same as the Adult Census dataset, but the feature values for continuous attributes are put in buckets. We evaluate Private-PGM's performance on this dataset. https://github.com/ryan112358/private-pgm/tree/master/data

The datasets can be downloaded to the /data folder by using the download_datasets.sh and can be preprocessed using the scripts in the /preprocess folder. Preprocessing is data set specific and mostly involves dealing with missing values, normalization, encoding of attribute values, splitting data into train and test etc.

Example :
sh download_datasets.sh adult
python preprocessing/preprocess_adult.py

Downstream classifiers :

Classifiers used are Logistic Regression, Multi layer Perceptron, Gaussain Naive Bayes, Random Forests and Gradient Boost with default settings from sklearn.

Data Format :

The data needs to be in csv format and has to be partitioned as train and test before feeding it to the models. The generative models are learned using the training data. The downstream classifiers are either trained using the real train data or synthetic data generated by the models. The classifiers are evaluated on the left out test data.

Currently only two attribute types are supported :

  1. All attributes are continuous : supported models are ron-gauss, pate-gan, dp-wgan, imle

  2. All attributes are categorical : supported model is private-pgm . The categorical attribute values should be between 0 and max_category - 1.

In case the data has both kinds of attributes, it needs to be pre-processed (discretization for continuous values/ encoding for categorical attrbiutes) to use one of the models. Missing values are not supported and needs to replaced appropriately by the user before usage.

NOTE : Some imputation methods compute statistics using other data samples to fill missing values. Care needs to be taken to make the computed statistics differentially private and the cost must be added to the generative modeling privacy cost to compute the total privacy cost.

The first line of the csv data file is assumed to contain the column names and the target column (labels) needs to be specified using the --target-variable flag when running the evaluation script as shown below.

How to:

python evaluate.py --target-variable=<> --train-data-path=<> --test-data-path=<> <model_name> --enable-privacy --target-epsilon=5 --target-delta=1e-5

Model names can be real-data, pate-gan, dp-wgan, ron-gauss, imle or private-pgm.

Example:

After preprocessing Adult data using the preprocess_adult.py, we can train a differentially private wasserstein GAN on it and evaluate the quality of the synthetic dataset using the below script :

python evaluate.py --target-variable='income' --train-data-path=./data/adult_processed_train.csv --test-data-path=./data/adult_processed_test.csv --normalize-data dp-wgan --enable-privacy --sigma=0.8 --target-epsilon=8

Example Output:

AUC scores of downstream classifiers on test data :
----------------------------------------
LR: 0.7411981709396546
----------------------------------------
Random Forest: 0.7540559254517339
----------------------------------------
Neural Network: 0.7311882809628891
----------------------------------------
GaussianNB: 0.7580265076488256
----------------------------------------
GradientBoostingClassifier: 0.747129484720164

Synthetic data can be saved in the /data folder using the flag --save-synthetic

Some useful user args:

General args:

--downstream-task : classification or regression

--normalize-data : Apply sigmoid function to each value in the data

--categorical : If all attrbiutes of the data are categorical

--target-variable : Attribute name denoting the target

Privacy args:

--enable-privacy : Enables private data generation. Non private mode can only be used for DP-WGAN and IMLE.

--target-epsilon : epsilon parameter of differential privacy

--target-delta : delta parameter of differential privacy

For more details refer to https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/dwork.pdf

Noisy gradient descent args:

--sigma : Gaussian noise variance multiplier. A larger sigma will make the model train for longer epochs for the same privacy budget

--clip-coeff : The coefficient to clip the gradients to before adding noise for private SGD training

--micro-batch-size : Parameter to tradeoff speed vs efficiency. Gradients are averaged for a microbatch and then clipped before adding noise

Model specific args:

PATE-GAN:

--lap-scale : Inverse laplace noise scale multiplier. A larger lap_scale will reduce the noise that is added per iteration of training

--num-teachers : Number of teacher disciminators

--teacher-iters : Teacher iterations during training per generator iteration

--student-iters : Student iterations during training per generator iteration

--num-moments : Number of higher moments to use for epsilon calculation

IMLE:

--decay-step : Learning rate decay step

--decay-rate : Learning rate decay rate

--staleness : Number of iterations after which new synthetic samples are generated

--num-samples-factor : Number of synthetic samples generated per real data point

DP-WGAN:

--clamp-lower : Lower clamp parameter for the weights of the NN in wasserstein GAN

--clamp-upper : Upper clamp parameter for the weights of the NN in wasserstein GAN

More Repositories

1

advertorch

A Toolbox for Adversarial Robustness Research
Jupyter Notebook
1,303
star
2

noise_flow

Noise Flow: Noise Modeling with Conditional Normalizing Flows
Python
148
star
3

scaleformer

Python
117
star
4

SLAPS-GNN

PyTorch code of "SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks"
Python
85
star
5

de-simple

Diachronic Embedding for Temporal Knowledge Graph Completion
Python
81
star
6

flora-opt

This is the official repository for the paper "Flora: Low-Rank Adapters Are Secretly Gradient Compressors" in ICML 2024.
Python
66
star
7

continuous-time-flow-process

PyTorch code of "Modeling Continuous Stochastic Processes with Dynamic Normalizing Flows" (NeurIPS 2020)
Python
45
star
8

ranksim-imbalanced-regression

[ICML 2022] RankSim: Ranking Similarity Regularization for Deep Imbalanced Regression
Python
40
star
9

lite_tracer

a light weight experiment reproducibility toolset
Python
39
star
10

pommerman-baseline

Code for the paper "Skynet: A Top Deep RL Agent in the Inaugural Pommerman Team Competition"
Python
37
star
11

mma_training

Code for the paper "MMA Training: Direct Input Space Margin Maximization through Adversarial Training"
Python
34
star
12

TSC-Disc-Proto

Discriminative Prototypes learned by Dynamic Time Warping (DTW) for Time Series Classification (TSC)
Python
31
star
13

MMoEEx-MTL

PyTorch Implementation of the Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx)
Python
30
star
14

mtmfrl

Multi Type Mean Field Reinforcement Learning
Python
28
star
15

CP-VAE

On Variational Learning of Controllable Representations for Text without Supervision https://arxiv.org/abs/1905.11975
Roff
27
star
16

cross_domain_coherence

A Cross-Domain Transferable Neural Coherence Model https://arxiv.org/abs/1905.11912
Python
24
star
17

bre-gan

Code for ICLR2018 paper: Improving GAN Training via Binarized Representation Entropy (BRE) Regularization - Y. Cao · W Ding · Y.C. Lui · R. Huang
Jupyter Notebook
20
star
18

DT-Fixup

Optimizing Deeper Transformers on Small Datasets https://arxiv.org/abs/2012.15355
Python
15
star
19

rate_distortion

Evaluating Lossy Compression Rates of Deep Generative Models
Python
14
star
20

PROVIDE

PROVIDE: A Probabilistic Framework for Unsupervised Video Decomposition (UAI 2021)
Python
13
star
21

efficient-vit-training

PyTorch code of "Training a Vision Transformer from scratch in less than 24 hours with 1 GPU" (HiTY workshop at Neurips 2022)
Python
13
star
22

continuous-latent-process-flows

Code, data, and pre-trained models for the paper "Continuous Latent Process Flows" (NeurIPS 2021)
Python
12
star
23

code-gen-TAE

Code generation from natural language with less prior and more monolingual data
Python
12
star
24

ssl-for-timeseries

Self Supervised Learning for Time Series Using Similarity Distillation
Python
10
star
25

OOS-KGE

PyTorch code of “Out-of-Sample Representation Learning for Multi-Relational Graphs” (EMNLP 2020)
Python
10
star
26

ConR

Contrastive Regularizer
Python
6
star
27

nflow-cdf-approximations

Official implementation of "Efficient CDF Approximations for Normalizing Flows"
Python
6
star
28

IMLE

Code for differentially private Implicit Maximum Likelihood Estimation model
C
5
star
29

keyphrase-generation

PyTorch code of “Diverse Keyphrase Generation with Neural Unlikelihood Training” (COLING 2020)
Python
5
star
30

towards-better-sel-cls

Python
5
star
31

latent-bottlenecked-anp

Python
5
star
32

BMI

Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer https://arxiv.org/abs/1905.11978
Python
5
star
33

StayPositive

Python
4
star
34

tree-cross-attention

Python
4
star
35

eval_dr_by_wsd

Evaluating quality of dimensionality reduction map with Wasserstein distances
Jupyter Notebook
3
star
36

autocast-plus-plus

[ICLR'24] AutoCast++: Enhancing World Event Prediction with Zero-shot Ranking-based Context Retrieval
Python
3
star
37

perturbed-forgetting

Training SAM, GSAM, ASAM with standard and OBF perturbations
Python
3
star
38

group-feature-importance

Group feature importance
Python
2
star
39

ProbForest

Differentiable relaxations of tree-based models.
Python
2
star
40

raps

Code for the paper "Causal Bandits without Graph Learning"
Jupyter Notebook
2
star
41

meta-tpp

PyTorch-Lightning implementation of Meta Temporal Point Processes
Python
2
star
42

sasrec-ccql

PyTorch code of "Robust Reinforcement Learning Objectives for Sequential Recommender Systems"
Python
2
star
43

adaflood

Python
1
star
44

monotonicity-mixup

Code of "Not Too Close and Not Too Far: Enforcing Monotonicity Requires Penalizing The Right Points"
Python
1
star
45

robust-gan

On Minimax Optimality of GANs for Robust Mean Estimation
Python
1
star
46

DynaShare-MTL

PyTorch Implementation of DynaShare: Task and Instance Conditioned Parameter Sharing for Multi-Task Learning
Python
1
star
47

dcf

Python
1
star