• Stars
    star
    144
  • Rank 255,676 (Top 6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 3 years ago
  • Updated 19 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An open benchmarking platform for medical artificial intelligence using Federated Evaluation.

Screen Shot 2022-10-20 at 11 16 02 AM

MedPerf

Medperf is an open benchmarking platform for medical artificial intelligence using Federated Evaluation.

What's included here

Inside this repo you can find all important pieces for running MedPerf. In its current state, it includes:

  • MedPerf Server:

    Backend server implemented in django. Can be found inside the server folder
  • MedPerf CLI:

    Command Line Interface for interacting with the server. Can be found inside the cli folder.
  • Results from our Pilots:

    In the examples folder we have the results from the following Pilots

Publications

If you use MedPerf, please cite our main paper: Karargyris, A., Umeton, R., Sheller, M.J. et al. Federated benchmarking of medical artificial intelligence with MedPerf. Nature Machine Intelligence 5, 799–810 (2023). https://www.nature.com/articles/s42256-023-00652-2

Additonally, here you can see how others used MedPerf already: https://scholar.google.com/scholar?q="medperf".

Experiments

FeTS Challenge

🎯 We pressure-tested MedPerf open benchmark framework by supporting the largest real-world federated study to date FeTS - that was benchmarking 41 medical AI models across 32 treatment and research sites in 6 continents 🌎 as well as on 🔓 public and 🔒 private data through medical research efforts across Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health, University of Pennsylvania, Penn Medicine, University of Pennsylvania Health System, University of Strasbourg, Institute of Image-Guided Surgery (IHU Strasbourg), Fondazione Policlinico Universitario Agostino Gemelli IRCCS, and others with the pilot studies below.

image

Pilot Studies

MedPerf was also further utilized to support academic medical research on both public and private data through efforts across Dana-Farber Cancer Institute, Harvard T.H. Chan School of Public Health, University of Pennsylvania, Penn Medicine, University of Pennsylvania Health System, University of Strasbourg, Institute of Image-Guided Surgery (IHU Strasbourg), Fondazione Policlinico Universitario Agostino Gemelli IRCCS, University of California San Francisco, and other academic institutions. The figure below displays the data provider locations used in all pilot experiments. 🟢: Pilot 1 - Brain Tumor Segmentation Pilot Experiment; 🔴: Pilot 2 - Pancreas Segmentation Pilot Experiment. 🔵: Pilot 3 - Surgical Workflow Phase Recognition Pilot Experiment. Pilot 4 - Cloud Experiments, used data and processes from Pilot 1 and 2.

image

POC 1 - Brain Tumor Segmentation

Participating institutions

  • University of Pennsylvania, Philadelphia, USA
  • Perelman School of Medicine, Philadelphia, USA
  • University of San Francisco, San Francisco, CA, USA

Task

Gliomas are highly heterogeneous across their molecular, phenotypical, and radiological landscape. Their radiological appearance is described by different sub-regions comprising 1) the “enhancing tumor” (ET), 2) the gross tumor, also known as the “tumor core” (TC), and 3) the complete tumor extent also referred to as the “whole tumor” (WT). ET is described by areas that show hyper-intensity in T1Gd when compared to T1, but also when compared to ”healthy” white matter in T1Gd. The TC describes the bulk of the tumor, which is what is typically resected. The TC entails the ET, as well as the necrotic (fluid-filled) parts of the tumor. The appearance of the necrotic (NCR) tumor core is typically hypo-intense in T1Gd when compared to T1. The WT describes the complete extent of the disease, as it entails the TC and the peritumoral edematous/invaded tissue (ED), which is typically depicted by hyper-intense signal in T2-FLAIR. These scans, with accompanying manually approved labels by expert neuroradiologists for these sub-regions, are provided in the International Brain Tumor Segmentation (BraTS) challenge data.

Data

The BraTS 2020 challenge dataset is a retrospective collection of 2,640 brain glioma multi-parametric magnetic resonance imaging (mpMRI) scans, from 660 patients, acquired at 23 geographically-distinct institutions under routine clinical conditions, i.e., with varying equipment and acquisition protocols.The exact mpMRI scans included in the BraTS challenge dataset are a) native (T1) and b) post-contrast T1-weighted (T1Gd), c) T2-weighted (T2), and d) T2-weighted Fluid Attenuated Inversion Recovery (T2-FLAIR). Notably, the BraTS 2020 dataset was utilized in the first ever federated learning challenge, namely the Federated Tumor Segmentation (FeTS) 2021 challenge (https://miccai.fets.ai/) that ran in conjunction with the Medical Image Computing and Computer Assisted Interventions (MICCAI) conference. Standardized pre-processing has been applied to all the BraTS mpMRI scans. This includes conversion of the DICOM files to the NIfTI file format, co-registration to the same anatomical template (SRI24), resampling to a uniform isotropic resolution (1mm3), and finally skull-stripping. The pre-processing pipeline is publicly available through the Cancer Imaging Phenomics Toolkit (CaPTk).

Code

github.com/mlcommons/medperf/tree/main/examples/BraTS

POC 2 - Pancreas Segmentation

Participating institutions

  • Harvard School of Public Health, Boston, USA
  • Dana-Farber Cancer Institute, Boston, USA

Task

Precise organ segmentation using computed tomography (CT) images is an important step for medical image analysis and treatment planning. Pancreas segmentation involves important challenges due to the small volume and irregular shapes of the areas of interest. Our goal is to perform federated evaluation across two different sites using MedPerf for the task of pancreas segmentation, to test the generalizability of a model trained on only one of these sites

Data

We utilized two separate datasets for the pilot experiment. The first of which is the Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) dataset, which is publicly available through synapse platform. Abdominal CT images from 50 metastatic liver cancer patients and the postoperative ventral hernia patients were acquired at the Vanderbilt University Medical Center. The abdominal CT images were registered using NiftyReg. In addition to the BTCV dataset, we also included another publicly available dataset from TCIA (The Cancer Imaging Archives) at the url (https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT); the National Institute of Health Clinical Center curated this dataset with 80 abdominal scans, from 53 male and 27 female subjects. Of which 17 patients had known kidney donations that confirmed healthy abdominal regions, and the remaining patients were selected after examination confirmed that the patients had neither pancreatic lesions nor any other significant abdominal abnormalities. Ourl visual inspection of both datasets confirmed that they were appropriate and complementary candidates for the pancreas segmentation task. Out of the 130 total cases, we used 10 subjects for inference, 5 from each dataset. We arbitrarily chose patients of IDs 1 to 5 from each one.

Code

github.com/mlcommons/medperf/tree/main/examples/DFCI

POC 3 - Surgical Workflow Phase Recognition

Participating institutions

  • University Hospital of Strasbourg, France
  • Policlinico Universitario Agostino Gemelli, Rome, Italy
  • Azienda Ospedaliero-Universitaria Sant’Andrea, Rome, Italy
  • Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico, Milan, Italy
  • Monaldi Hospital, Naples, Italy

Task

Surgical phase recognition is a classification task of each video frame from a recorded surgery to one of some predefined phases that give a coarse description of the surgical workflow. This task is a building block for context-aware systems that help in assisting surgeons for better Operating Room (OR) safety.

Data

The data we used corresponds to Multichole2022 a multicentric dataset generated by the research group CAMMA at IHU Strasbourg/University of Strasbourg comprising videos of recorded laparoscopic cholecystectomy surgeries, annotated for the task of surgical phase recognition. The dataset consists of 180 videos in total, of which 56 videos were used in our pilot experiment and the rest of the videos (i.e., 124) were used to train the model. The videos were taken from five (5) different hospitals: 32 videos from the University Hospital of Strasbourg, France; which are part of the public dataset Cholec80, and 6 videos were taken from each of the following Italian hospitals: Policlinico Universitario Agostino Gemelli, Rome; Azienda Ospedaliero-Universitaria Sant’Andrea, Rome; Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico, Milan; and Monaldi Hospital, Naples. The data is still private for now. Videos are annotated according to the Multichole2022 annotation protocol, with 6 surgical phases: Preparation, Hepatocytic Triangle Dissection, Clipping and Cutting, Gallbladder Dissection, Gallbladder Packaging, and Cleaning / Coagulation

Code

github.com/mlcommons/medperf/tree/main/examples/SurgMLCube

POC 4 - Cloud Experiments

Task

We proceeded to further validate MedPerf on the cloud. Towards this, we executed various parts of the MedPerf architecture across different cloud providers. Google Cloud Platform (GCP) was used across all experiments for hosting the server. The Brain Tumor Segmentation (BraTS) Benchmark (Pilot 1), as well as part of the Pancreas Segmentation Benchmark (Pilot 2), were executed inside a GCP Virtual Machine with 128GB of RAM and an Nvidia T4. Lastly, we created a Chest X-Ray Pathology Classification Benchmark to demonstrate the feasibility of running federated evaluation across different cloud providers. For this, the CheXpert 40 small validation dataset was partitioned into 4 splits, and executed inside Virtual Machines provided by AWS, Alibaba, Azure, and IBM. All results were retrieved by the MedPerf server, hosted on GCP. The figure below shows which cloud provider each MedPerf component (i.e., server, client) and dataset was executed on.

Data

Here we used data and processes from Pilot #1 and #2.

Code

github.com/mlcommons/medperf/tree/main/examples/ChestXRay

Architecture

image

How to Use MedPerf

Get Started with our hands-on tutorials.

Documentation Contribution

If you wish to contribute to our documentation, here are the steps for successfully building and serving documentation locally:

  • Install dependencies: We are currently using mkdocs for serving documentation. You can install all the requirements by running the following command:
    pip install -r docs/requirements.txt
    
  • Serve local documentation: To run your own local server for visualizing documentation, run:
    mkdocs serve
    
  • Access local documentation: Once mkdocs is done setting up the server, you should be able to access your local documentation website by heading to http:/localhost:8000 on your browser.

More Repositories

1

training

Reference implementations of MLPerf™ training benchmarks
Python
1,495
star
2

inference

Reference implementations of MLPerf™ inference benchmarks
Python
966
star
3

ck

Collective Knowledge (CK) is an educational community project to learn how to run AI, ML and other emerging workloads in the most efficient and cost-effective way across diverse models, data sets, software and hardware using MLCommons CM (Collective Mind workflow automation framework)
Python
605
star
4

tiny

MLPerf™ Tiny is an ML benchmark suite for extremely low-power systems such as microcontrollers
C++
293
star
5

algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
Python
210
star
6

GaNDLF

A generalizable application framework for segmentation, regression, and classification using PyTorch
Python
154
star
7

mlcube

MLCube® is a project that reduces friction for machine learning by ensuring that models are easily portable and reproducible.
Python
149
star
8

peoples-speech

The People’s Speech Dataset
Jupyter Notebook
96
star
9

training_policies

Issues related to MLPerf™ training policies, including rules and suggested changes
Python
91
star
10

training_results_v0.7

This repository contains the results and code for the MLPerf™ Training v0.7 benchmark.
Python
58
star
11

inference_results_v0.5

This repository contains the results and code for the MLPerf™ Inference v0.5 benchmark.
C++
56
star
12

modelbench

Run safety benchmarks against AI models and view detailed reports showing how well they performed.
Python
53
star
13

inference_policies

Issues related to MLPerf™ Inference policies, including rules and suggested changes
50
star
14

training_results_v0.6

This repository contains the results and code for the MLPerf™ Training v0.6 benchmark.
Python
42
star
15

croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
Jupyter Notebook
42
star
16

training_results_v0.5

This repository contains the results and code for the MLPerf™ Training v0.5 benchmark.
Python
36
star
17

training_results_v1.0

This repository contains the results and code for the MLPerf™ Training v1.0 benchmark.
Python
36
star
18

hpc

Reference implementations of MLPerf™ HPC training benchmarks
Jupyter Notebook
33
star
19

storage

MLPerf™ Storage Benchmark Suite
Shell
33
star
20

inference_results_v1.0

This repository contains the results and code for the MLPerf™ Inference v1.0 benchmark.
C++
31
star
21

mlcube_examples

MLCube® examples
Python
30
star
22

chakra

Repository for MLCommons Chakra schema and tools
Python
30
star
23

mobile_app_open

Mobile App Open
C++
30
star
24

training_results_v2.0

This repository contains the results and code for the MLPerf™ Training v2.0 benchmark.
C++
27
star
25

modelgauge

Make it easy to automatically and uniformly measure the behavior of many AI Systems.
Python
25
star
26

policies

General policies for MLPerf™ including submission rules, coding standards, etc.
Python
24
star
27

training_results_v1.1

This repository contains the results and code for the MLPerf™ Training v1.1 benchmark.
Python
23
star
28

mobile_models

MLPerf™ Mobile models
22
star
29

logging

MLPerf™ logging library
Python
20
star
30

inference_results_v2.1

This repository contains the results and code for the MLPerf™ Inference v2.1 benchmark.
19
star
31

ck-mlops

A collection of portable workflows, automation recipes and components for MLOps in a unified CK format. Note that this repository is outdated - please check the 2nd generation of the CK workflow automation meta-framework with portable MLOps and DevOps components here:
Python
17
star
32

inference_results_v0.7

This repository contains the results and code for the MLPerf™ Inference v0.7 benchmark.
C++
17
star
33

inference_results_v3.0

This repository contains the results and code for the MLPerf™ Inference v3.0 benchmark.
16
star
34

training_results_v2.1

This repository contains the results and code for the MLPerf™ Training v2.1 benchmark.
C++
15
star
35

power-dev

Dev repo for power measurement for the MLPerf™ benchmarks
Python
14
star
36

medical

Medical ML Benchmark
Python
13
star
37

dynabench

Python
12
star
38

training_results_v3.0

This repository contains the results and code for the MLPerf™ Training v3.0 benchmark.
Python
11
star
39

tiny_results_v0.7

This repository contains the results and code for the MLPerf™ Tiny Inference v0.7 benchmark.
C
11
star
40

inference_results_v1.1

This repository contains the results and code for the MLPerf™ Inference v1.1 benchmark.
Python
11
star
41

inference_results_v4.0

This repository contains the results and code for the MLPerf™ Inference v4.0 benchmark.
9
star
42

dataperf

Data Benchmarking
8
star
43

inference_results_v2.0

This repository contains the results and code for the MLPerf™ Inference v2.0 benchmark.
Python
8
star
44

mobile_open

MLPerf Mobile benchmarks
Python
7
star
45

science

https://mlcommons.org/en/groups/research-science/
Jupyter Notebook
7
star
46

tiny_results_v0.5

This repository contains the results and code for the MLPerf™ Tiny Inference v0.5 benchmark.
C++
5
star
47

inference_results_v3.1

This repository contains the results and code for the MLPerf™ Inference v3.1 benchmark.
5
star
48

tiny_results_v1.0

This repository contains the results and code for the MLPerf™ Tiny Inference v1.0 benchmark.
C
4
star
49

hpc_results_v0.7

This repository contains the results and code for the MLPerf™ HPC Training v0.7 benchmark.
Python
3
star
50

hpc_results_v2.0

This repository contains the results and code for the MLPerf™ HPC Training v2.0 benchmark.
Python
3
star
51

hpc_results_v1.0

This repository contains the results and code for the MLPerf™ HPC Training v1.0 benchmark.
Python
3
star
52

ck-venv

CK automation for virtual environments
Python
2
star
53

cm-mlops

Python
2
star
54

datasets_infra

2
star
55

training_results_v3.1

This repository contains the results and code for the MLPerf™ Training v3.1 benchmark.
Python
1
star
56

research

1
star
57

tiny_results_v1.1

This repository contains the results and code for the MLPerf™ Tiny Inference v1.1 benchmark.
C
1
star
58

medperf-website

JavaScript
1
star
59

mobile_results_v2.1

This repository contains the results and code for the MLPerf™ Mobile Inference v2.1 benchmark.
1
star
60

hpc_results_v3.0

This repository contains the results and code for the MLPerf™ HPC Training v3.0 benchmark.
Python
1
star
61

ck_mlperf_results

Aggregated benchmarking results from MLPerf Inference, Tiny and Training in the MLCommons CM format for the Collective Knowledge Playground. Our goal is to make it easier for the community to visualize, compare and reproduce MLPerf results and add derived metrics such as Performance/Watt or Performance/$
Python
1
star