• Stars
    star
    203
  • Rank 192,294 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created about 7 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A deep learning approach to predicting breast tumor proliferation scores for the TUPAC16 challenge

Predicting Breast Cancer Proliferation Scores with TensorFlow, Keras, and Apache Spark

Note: This project is still a work in progress. There is also an experimental branch with additional files and experiments.

Overview

The Tumor Proliferation Assessment Challenge 2016 (TUPAC16) is a "Grand Challenge" that was created for the 2016 Medical Image Computing and Computer Assisted Intervention (MICCAI 2016) conference. In this challenge, the goal is to develop state-of-the-art algorithms for automatic prediction of tumor proliferation scores from whole-slide histopathology images of breast tumors.

Background

Breast cancer is the leading cause of cancerous death in women in less-developed countries, and is the second leading cause of cancerous deaths in developed countries, accounting for 29% of all cancers in women within the U.S. [1]. Survival rates increase as early detection increases, giving incentive for pathologists and the medical world at large to develop improved methods for even earlier detection [2]. There are many forms of breast cancer including Ductal Carcinoma in Situ (DCIS), Invasive Ductal Carcinoma (IDC), Tubular Carcinoma of the Breast, Medullary Carcinoma of the Breast, Invasive Lobular Carcinoma, Inflammatory Breast Cancer and several others [3]. Within all of these forms of breast cancer, the rate in which breast cancer cells grow (proliferation), is a strong indicator of a patient’s prognosis. Although there are many means of determining the presence of breast cancer, tumor proliferation speed has been proven to help pathologists determine the best treatment for the patient. The most common technique for determining the proliferation speed is through mitotic count (mitotic index) estimates, in which a pathologist counts the dividing cell nuclei in hematoxylin and eosin (H&E) stained slide preparations to determine the number of mitotic bodies. Given this, the pathologist produces a proliferation score of either 1, 2, or 3, ranging from better to worse prognosis [4]. Unfortunately, this approach is known to have reproducibility problems due to the variability in counting, as well as the difficulty in distinguishing between different grades.

References:
[1] http://emedicine.medscape.com/article/1947145-overview#a3
[2] http://emedicine.medscape.com/article/1947145-overview#a7
[3] http://emedicine.medscape.com/article/1954658-overview
[4] http://emedicine.medscape.com/article/1947145-workup#c12

Goal & Approach

In an effort to automate the process of classification, this project aims to develop a large-scale deep learning approach for predicting tumor scores directly from the pixels of whole-slide histopathology images (WSI). Our proposed approach is based on a recent research paper from Stanford [1]. Starting with 500 extremely high-resolution tumor slide images [2] with accompanying score labels, we aim to make use of Apache Spark in a preprocessing step to cut and filter the images into smaller square samples, generating 4.7 million samples for a total of ~7TB of data [3]. We then utilize TensorFlow and Keras to train a deep convolutional neural network on these samples, making use of transfer learning by fine-tuning a modified ResNet50 model [4]. Our model takes as input the pixel values of the individual samples, and is trained to predict the correct tumor score classification for each one. We also explore an alternative approach of first training a mitosis detection model [5] on an auxiliary mitosis dataset, and then applying it to the WSIs, based on an approach from Paeng et al. [6]. Ultimately, we aim to develop a model that is sufficiently stronger than existing approaches for the task of breast cancer tumor proliferation score classification.

References:
[1] https://web.stanford.edu/group/rubinlab/pubs/2243353.pdf
[2] http://tupac.tue-image.nl/node/3
[3] preprocess.py, breastcancer/preprocessing.py
[4] MachineLearning-Keras-ResNet50.ipynb
[5] preprocess_mitoses.py, train_mitoses.py
[6] https://arxiv.org/abs/1612.07180

Approach


Setup (All nodes unless other specified):

  • System Packages:

    • openslide
  • Python packages:

    • Basics
      • pip3 install -U matplotlib numpy pandas scipy jupyter ipython scikit-learn scikit-image openslide-python
    • TensorFlow (only on driver):
      • pip3 install tensorflow-gpu (or pip3 install tensorflow for CPU-only)
    • Keras (bleeding-edge; only on driver):
      • pip3 install git+https://github.com/fchollet/keras.git
  • Spark 2.x (ideally bleeding-edge)

  • Add the following to the data folder (same location on all nodes):

    • training_image_data folder with the training slides.
    • testing_image_data folder with the testing slides.
    • training_ground_truth.csv file containing the tumor & molecular scores for each slide.
    • mitoses folder with the following from the mitosis detection auxiliary dataset:
      • mitoses_test_image_data folder with the folders of testing images
      • mitoses_train_image_data folder with the folders of training images
      • mitoses_train_ground_truth folder with the folders of training csv files
  • Layout:

    - MachineLearning-Keras-ResNet50.ipynb
    - breastcancer/
      - preprocessing.py
      - visualization.py
    - ...
    - data/
      - mitoses
        - mitoses_test_image_data
          - 01
            - 01.tif
          - 02
            - 01.tif
          ...
        - mitoses_train_ground_truth
          - 01
            - 01.csv
            - 02.csv
            ...
          - 02
            - 01.csv
            - 02.csv
            ...
          ...
        - mitoses_train_image_data
          - 01
            - 01.tif
            - 02.tif
            ...
          - 02
            - 01.tif
            - 02.tif
            ...
          ...
      - training_ground_truth.csv
      - training_image_data
        - TUPAC-TR-001.svs
        - TUPAC-TR-002.svs
        - ...
      - testing_image_data
        - TUPAC-TE-001.svs
        - TUPAC-TE-002.svs
        - ...
    - preprocess.py
    - preprocess_mitoses.py
    - train_mitoses.py
    
  • Adjust the Spark settings in $SPARK_HOME/conf/spark-defaults.conf using the following examples, depending on the job being executed:

    • All jobs:

      # Use most of the driver memory.
      spark.driver.memory 70g
      # Remove the max result size constraint.
      spark.driver.maxResultSize 0
      # Increase the message size.
      spark.rpc.message.maxSize 128
      # Extend the network timeout threshold.
      spark.network.timeout 1000s
      # Setup some extra Java options for performance.
      spark.driver.extraJavaOptions -server -Xmn12G
      spark.executor.extraJavaOptions -server -Xmn12G
      # Setup local directories on separate disks for intermediate read/write performance, if running
      # on Spark Standalone clusters.
      spark.local.dirs /disk2/local,/disk3/local,/disk4/local,/disk5/local,/disk6/local,/disk7/local,/disk8/local,/disk9/local,/disk10/local,/disk11/local,/disk12/local
      
    • Preprocessing:

      # Save 1/2 executor memory for Python processes
      spark.executor.memory 50g
      
  • To execute the WSI preprocessing script, use spark-submit as follows (could also use Yarn in client mode with --master yarn --deploy-mode client):

    PYSPARK_PYTHON=python3 spark-submit --master spark://MASTER_URL:7077 preprocess.py
    
  • To execute the mitoses preprocessing script, use the following:

    python3 preprocess_mitoses.py --help
    
  • To execute the mitoses training script, use the following:

    python3 training_mitoses.py --help
    
  • To use the Jupyter notebooks, start up Jupyter like normal with jupyter notebook and run the desired notebook.

Create a Histopath slide β€œlab” to view the slides (just driver):

  • git clone https://github.com/openslide/openslide-python.git
  • Host locally:
    • python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py -Q 100 path/to/data/
  • Host on server:
    • python3 path/to/openslide-python/examples/deepzoom/deepzoom_multiserver.py -Q 100 -l HOSTING_URL_HERE path/to/data/
    • Open local browser to HOSTING_URL_HERE:5000.

More Repositories

1

spark-bench

Benchmark Suite for Apache Spark
Scala
238
star
2

text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Jupyter Notebook
217
star
3

stocator

Stocator is high performing connector to object storage for Apache Spark, achieving performance by leveraging object storage semantics.
Java
111
star
4

covid-notebooks

Jupyter notebooks that analyze COVID-19 time series data
Jupyter Notebook
104
star
5

max-central-repo

Central Repository of Model Asset Exchange project. This repository contains information about the available models, current project status, contribution guidelines and supporting assets.
78
star
6

aardpfark

A library for exporting Spark ML models and pipelines to PFA
Scala
54
star
7

presentations

Talks & Workshops by the CODAIT team
Jupyter Notebook
52
star
8

r4ml

Scalable R for Machine Learning
R
42
star
9

spark-ref-architecture

Reference Architectures for Apache Spark
Scala
38
star
10

graph_def_editor

GraphDef Editor: A port of the TensorFlow contrib.graph_editor package that operates over serialized graphs
Python
31
star
11

magicat

πŸ§™πŸ˜Ί magicat - Deep learning magic.. with the convenience of cat!
JavaScript
26
star
12

node-red-contrib-model-asset-exchange

Node-RED nodes for the Model Asset Exchange on IBM Developer
JavaScript
20
star
13

max-tfjs-models

Pre-trained TensorFlow.js models for the Model Asset Exchange
JavaScript
18
star
14

pardata

Python
17
star
15

nlp-editor

Visual Editor for Natural Language Processing pipelines
JavaScript
15
star
16

flight-delay-notebooks

Analyzing flight delay and weather data using Elyra, IBM Data Asset Exchange, Kubeflow Pipelines and KFServing
Jupyter Notebook
15
star
17

spark-db2

DB2/DashDB Connector for Apache Spark
Scala
14
star
18

redrock

RedRock - Mobile Application prototype using Apache Spark, Twitter and Elasticsearch
Scala
14
star
19

spark-netezza

Netezza Connector for Apache Spark
Scala
13
star
20

Identifying-Incorrect-Labels-In-CoNLL-2003

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
Jupyter Notebook
12
star
21

max-vis

Image annotation library and command-line utility for MAX image models
JavaScript
9
star
22

fae-tfjs

JavaScript
9
star
23

WELCOME-TO-CODAIT

Welcome to the Center for Open-Source Data & AI Technologies (CODAIT) organization on GitHub! Learn more about our projects ...
8
star
24

spark-tracing

A flexible instrumentation package for visualizing the internal operation of Apache Spark and related tools
Scala
8
star
25

redrock-v2

RedRock v2 Repository
Jupyter Notebook
8
star
26

max-node-red-docker-image

Demo Docker image for the Model Asset Exchange Node-RED module
Dockerfile
8
star
27

max-workshop-oscon-2019

7
star
28

notebook-exporter

One Click deployment of Notebooks - Bringing Notebooks to Production
Scala
6
star
29

redrock-ios

RedRock - Mobile Application prototype
JavaScript
4
star
30

max-base

This repo has been moved
Python
4
star
31

max-status

Current status of the Model Asset Exchange ecosystem
4
star
32

project-codenet-notebooks

Jupyter Notebook
3
star
33

MAX-Web-App-skeleton

A fully functioning skeleton for MAX model web apps
JavaScript
3
star
34

development-guidelines

Development Guidelines and related resources for IBM Spark Technology Center
3
star
35

codait.github.io

CODAIT Homepage
HTML
3
star
36

dax-schemata

Python
2
star
37

redrock-v2-ios

RedRock v2 iPad Application
JavaScript
2
star
38

max-pytorch-mnist

Jupyter Notebook
2
star
39

teach-nao-robot-a-new-skill

Teach your NAO robot a new skill using deep learning microservices
2
star
40

max-fashion-mnist-tutorial-app

Python
1
star
41

MAX-cloud-deployment-cheatsheets

Work in progress
1
star
42

ddc-data-and-ai-2021-automate-using-open-source

Jupyter Notebook
1
star
43

exchange-metadata-converter

Basic conversion utility for YAML-based metadata descriptors
Python
1
star
44

streaming-integration-sample

Scala
1
star
45

covid-trusted-ai-pipeline

Jupyter Notebook
1
star