• Stars
    star
    558
  • Rank 79,819 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 4 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Federated Learning Benchmark - Federated Learning on Non-IID Data Silos: An Experimental Study (ICDE 2022)

NIID-Bench

This is the code of paper Federated Learning on Non-IID Data Silos: An Experimental Study.

This code runs a benchmark for federated learning algorithms under non-IID data distribution scenarios. Specifically, we implement 4 federated learning algorithms (FedAvg, FedProx, SCAFFOLD & FedNova), 3 types of non-IID settings (label distribution skew, feature distribution skew & quantity skew) and 9 datasets (MNIST, Cifar-10, Fashion-MNIST, SVHN, Generated 3D dataset, FEMNIST, adult, rcv1, covtype).

Updates on NIID-Bench

Implement partition.py to divide tabular datasets (csv format) into multiple files using our non-IID partitioning strategies. Column Class in the header is recognized as label. See an running example in partition_to_file.sh. The example dataset is Credit Card Fraud Detection.

To adapt to your own tabular dataset in โ€‹partition.py, you need the following steps:

  1. Load your own dataset in arrays. Replace Line 117-126.
  2. The whole tabular dataset is stored in โ€‹datasetโ€‹โ€‹. The label column ID is stored in class_idโ€‹โ€‹. Change Line 130 to your own label identifier.

If your dataset is image dataset, partition.pyโ€‹โ€‹ is no longer applicable. You can refer to our function partition_dataโ€‹โ€‹ in utils.py. You need to design your own dataloader like Line 183-198. For example, in load_mnist_data (Line 40), you need to write a dataloader to return your dataset as tuple (X_train, y_train, X_test, y_test). In terms of the dataloader format, you can refer to class MNIST_truncatedโ€‹โ€‹ (Line 60 in dataset.py). After you get (X_train, y_train, X_test, y_test), the partition_data function will return the โ€‹net_dataidx_mapโ€‹.

To support more settings and faciliate future researches, we now integrate MOON. We add CIFAR-100 and Tiny-ImageNet.

Tiny-ImageNet

You can download Tiny-ImageNet here. Then, you can follow the instructions to reformat the validation folder.

Non-IID Settings

Label Distribution Skew

  • Quantity-based label imbalance: each party owns data samples of a fixed number of labels.
  • Distribution-based label imbalance: each party is allocated a proportion of the samples of each label according to Dirichlet distribution.

Feature Distribution Skew

  • Noise-based feature imbalance: We first divide the whole dataset into multiple parties randomly and equally. For each party, we add different levels of Gaussian noises.
  • Synthetic feature imbalance: For generated 3D data set, we allocate two parts which are symmetric of(0,0,0) to a subset for each party.
  • Real-world feature imbalance: For FEMNIST, we divide and assign the writers (and their characters) into each party randomly and equally.

Quantity Skew

  • While the data distribution may still be consistent amongthe parties, the size of local dataset varies according to Dirichlet distribution.

Usage

Here is one example to run this code:

python experiments.py --model=simple-cnn \
    --dataset=cifar10 \
    --alg=fedprox \
    --lr=0.01 \
    --batch-size=64 \
    --epochs=10 \
    --n_parties=10 \
    --mu=0.01 \
    --rho=0.9 \
    --comm_round=50 \
    --partition=noniid-labeldir \
    --beta=0.5\
    --device='cuda:0'\
    --datadir='./data/' \
    --logdir='./logs/' \
    --noise=0 \
    --sample=1 \
    --init_seed=0
Parameter Description
model The model architecture. Options: simple-cnn, vgg, resnet, mlp. Default = mlp.
dataset Dataset to use. Options: mnist, cifar10, fmnist, svhn, generated, femnist, a9a, rcv1, covtype. Default = mnist.
alg The training algorithm. Options: fedavg, fedprox, scaffold, fednova, moon. Default = fedavg.
lr Learning rate for the local models, default = 0.01.
batch-size Batch size, default = 64.
epochs Number of local training epochs, default = 5.
n_parties Number of parties, default = 2.
mu The proximal term parameter for FedProx, default = 0.001.
rho The parameter controlling the momentum SGD, default = 0.
comm_round Number of communication rounds to use, default = 50.
partition The partition way. Options: homo, noniid-labeldir, noniid-#label1 (or 2, 3, ..., which means the fixed number of labels each party owns), real, iid-diff-quantity. Default = homo
beta The concentration parameter of the Dirichlet distribution for heterogeneous partition, default = 0.5.
device Specify the device to run the program, default = cuda:0.
datadir The path of the dataset, default = ./data/.
logdir The path to store the logs, default = ./logs/.
noise Maximum variance of Gaussian noise we add to local party, default = 0.
sample Ratio of parties that participate in each communication round, default = 1.
init_seed The initial seed, default = 0.

Data Partition Map

You can call function get_partition_dict() in experiments.py to access net_dataidx_map. net_dataidx_map is a dictionary. Its keys are party ID, and the value of each key is a list containing index of data assigned to this party. For our experiments, we usually set init_seed=0. When we repeat experiments of some setting, we change init_seed to 1 or 2. The default value of noise is 0 unless stated. We list the way to get our data partition as follow.

  • Quantity-based label imbalance: partition=noniid-#label1, noniid-#label2 or noniid-#label3
  • Distribution-based label imbalance: partition=noniid-labeldir, beta=0.5 or 0.1
  • Noise-based feature imbalance: partition=homo, noise=0.1 (actually noise does not affect net_dataidx_map)
  • Synthetic feature imbalance & Real-world feature imbalance: partition=real
  • Quantity Skew: partition=iid-diff-quantity, beta=0.5 or 0.1
  • IID Setting: partition=homo
  • Mixed skew: partition = mixed for mixture of distribution-based label imbalance and quantity skew; partition = noniid-labeldir and noise = 0.1 for mixture of distribution-based label imbalance and noise-based feature imbalance.

Here is explanation of parameter for function get_partition_dict().

Parameter Description
dataset Dataset to use. Options: mnist, cifar10, fmnist, svhn, generated, femnist, a9a, rcv1, covtype.
partition Tha partition way. Options: homo, noniid-labeldir, noniid-#label1 (or 2, 3, ..., which means the fixed number of labels each party owns), real, iid-diff-quantity
n_parties Number of parties.
init_seed The initial seed.
datadir The path of the dataset.
logdir The path to store the logs.
beta The concentration parameter of the Dirichlet distribution for heterogeneous partition.

Leader Board

Note that the accuracy shows the average of three experiments, while the training curve is based on only one experiment. Thus, there may be some difference. We show the training curve to compare convergence rate of different algorithms.

Quantity-based label imbalance

  • Cifar-10, 10 parties, sample rate = 1, batch size = 64, learning rate = 0.01
Partition Model Round Algorithm Accuracy
noniid-#label2 simple-cnn 50 FedProx (mu=0.01) 50.7%
noniid-#label2 simple-cnn 50 FedAvg 49.8%
noniid-#label2 simple-cnn 50 SCAFFOLD 49.1%
noniid-#label2 simple-cnn 50 FedNova 46.5%


  • Cifar-10, 100 parties, sample rate = 0.1, batch size = 64, learning rate = 0.01
Partition Model Round Algorithm Accuracy
noniid-#label2 simple-cnn 500 FedNova 48.0%
noniid-#label2 simple-cnn 500 FedAvg 45.3%
noniid-#label2 simple-cnn 500 FedProx (mu=0.001) 39.3%
noniid-#label2 simple-cnn 500 SCAFFOLD 10.0%


Distribution-based label imbalance

  • Cifar-10, 10 parties, sample rate = 1, batch size = 64, learning rate = 0.01
Partition Model Round Algorithm Accuracy
noniid-labeldir with beta=0.5 simple-cnn 50 SCAFFOLD 69.8%
noniid-labeldir with beta=0.5 simple-cnn 50 FedAvg 68.2%
noniid-labeldir with beta=0.5 simple-cnn 50 FedProx (mu=0.001) 67.9%
noniid-labeldir with beta=0.5 simple-cnn 50 FedNova 66.8%


Partition Model Round Algorithm Accuracy
noniid-labeldir with beta=0.1 vgg 100 SCAFFOLD 85.5%
noniid-labeldir with beta=0.1 vgg 100 FedNova 84.4%
noniid-labeldir with beta=0.1 vgg 100 FedProx (mu=0.01) 84.4%
noniid-labeldir with beta=0.1 vgg 100 FedAvg 84.0%


  • Cifar-10, 100 parties, sample rate = 0.1, batch size = 64, learning rate = 0.01
Partition Model Round Algorithm Accuracy
noniid-labeldir with beta=0.5 simple-cnn 500 FedNova 60.0%
noniid-labeldir with beta=0.5 simple-cnn 500 FedAvg 59.4%
noniid-labeldir with beta=0.5 simple-cnn 500 FedProx (mu=0.001) 58.8%
noniid-labeldir with beta=0.5 simple-cnn 500 SCAFFOLD 10.0%


Noise-based feature imbalance

  • Cifar-10, 10 parties, sample rate = 1, batch size = 64, learning rate = 0.01
Partition Model Round Algorithm Accuracy
homo with noise=0.1 simple-cnn 50 SCAFFOLD 70.1%
homo with noise=0.1 simple-cnn 50 FedProx (mu=0.01) 69.3%
homo with noise=0.1 simple-cnn 50 FedAvg 68.9%
homo with noise=0.1 simple-cnn 50 FedNova 68.5%


Partition Model Round Algorithm Accuracy
homo with noise=0.1 resnet 100 SCAFFOLD 90.2%
homo with noise=0.1 resnet 100 FedNova 89.4%
homo with noise=0.1 resnet 100 FedProx (mu=0.01) 89.2%
homo with noise=0.1 resnet 100 FedAvg 89.1%


Quantity Skew

  • Cifar-10, 10 parties, sample rate = 1, batch size = 64, learning rate = 0.01
Partition Model Round Algorithm Accuracy
iid-diff-quantity with beta=0.5 simple-cnn 50 FedAvg 72.0%
iid-diff-quantity with beta=0.5 simple-cnn 50 FedProx (mu=0.01) 71.2%
iid-diff-quantity with beta=0.5 simple-cnn 50 SCAFFOLD 62.4%
iid-diff-quantity with beta=0.5 simple-cnn 50 FedNova 10.0%


IID Setting

  • Cifar-10, 100 parties, sample rate = 0.1, batch size = 64, learning rate = 0.01
Partition Model Round Algorithm Accuracy
homo simple-cnn 500 FedNova 66.1%
homo simple-cnn 500 FedProx (mu=0.01) 66.0%
homo simple-cnn 500 FedAvg 65.6%
homo simple-cnn 500 SCAFFOLD 10.0%


Citation

If you find this repository useful, please cite our paper:

@inproceedings{li2022federated,
      title={Federated Learning on Non-IID Data Silos: An Experimental Study},
      author={Li, Qinbin and Diao, Yiqun and Chen, Quan and He, Bingsheng},
      booktitle={IEEE International Conference on Data Engineering},
      year={2022}
}

More Repositories

1

thundersvm

ThunderSVM: A Fast SVM Library on GPUs and CPUs
C++
1,564
star
2

thundergbm

ThunderGBM: Fast GBDTs and Random Forests on GPUs
C++
692
star
3

FedTree

A tree-based federated learning system (MLSys 2023)
C++
142
star
4

ThunderGP

HLS-based Graph Processing Framework on FPGAs
C++
135
star
5

Medusa

Medusa: Building GPU-based Parallel Sparse Graph Applications with Sequential C/C++ Code
Cuda
61
star
6

Awesome-Literature-ILoGs

Awesome literature on imbalanced learning on graphs
58
star
7

G3

G3: A Programmable GNN Training System on GPU
Cuda
42
star
8

briskstream

A Multicore, NUMA Optimised Data Stream Processing System
Java
39
star
9

PyOE

Python library for data stream learning
Python
28
star
10

ThunderRW

Source code of "ThunderRW: An In-Memory Graph Random Walk Engine" published in VLDB'2021 - By Shixuan Sun, Yuhang Chen, Shengliang Lu, Bingsheng He and Yuchen Li
C++
26
star
11

FedSim

A coupled vertical federated learning framework that boosts the model performance with record similarities (NeurIPS 2022)
Python
23
star
12

PrivML

20
star
13

SOFF

Python
19
star
14

ConsisGAD

Python
18
star
15

SimFL

Practical Federated Gradient Boosting Decision Trees (AAAI 2020)
C++
18
star
16

ForkGraph

C++
16
star
17

ReGraph

Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines
C++
16
star
18

ThundeRiNG

Fast Multiple Independent Random Number Sequences Generation on FPGAs
C++
14
star
19

hacc_demo

Shell
14
star
20

FedOV

Towards Addressing Label Skews in One-Shot Federated Learning (ICLR 2023)
Python
14
star
21

Vine

Accelerating Exact Constrained Shortest Paths on GPUs
C++
14
star
22

PathEnum

Source code of "PathEnum: Towards Real-Time Hop-Constrained s-t Path Enumeration", published in SIGMOD'2021 - By Shixuan Sun, Yuhang Chen, Bingsheng He, and Bryan Hooi
C++
14
star
23

OEBench

OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams (VLDB 2024)
Python
13
star
24

VertiBench

Feature partitioner by imbalance or correlation (ICLR 2024)
Jupyter Notebook
9
star
25

omniDB

General query processing engine
C++
7
star
26

LightRW

C++
6
star
27

HashjoinOnHARP

The MAIN project of the paper "Is FPGA useful for Hash Joins?"
C++
5
star
28

PMP

Python
5
star
29

RUSH

A fast library for real-time burst subgraph detection
Python
4
star
30

On-the-fly-data-shuffling-for-OpenCL-based-FPGAs

JavaScript
4
star
31

DeltaBoost

GBDT-based model with efficient unlearning (SIGMOD 2023)
C++
4
star
32

ModelGo

TeX
4
star
33

Pyper

3
star
34

KGraph

Concurrent Graph Query Processing with Memoization on Graph
3
star
35

Awesome-Prompt-For-Research

Awesome prompts for computer science research including paper editting and code debugging
2
star
36

Melia

C
2
star
37

Query_on_OpenCL_FPGA

C++
1
star
38

FedGMA

Communication-Efficient Generalized Neuron Matching for Federated Learning (ICPP'23)
Python
1
star
39

HashJoin_HMA

A hash join implementation optimized for many-core processors with die-stacked HBMs
C++
1
star
40

Clementi

Clementi: A Scalable Multi-FPGA Graph Processing Framework
C++
1
star