SIIM-ACR Pneumothorax Segmentation
If you use this code in research, please cite the following paper:
@misc{Aimoldin2019,
author = {Aimoldin Anuar},
title = {{SIIMโACR} {P}neumothorax {S}egmentation},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/sneddy/pneumothorax-segmentation}},
}
First place solution
Video with short explanation: https://youtu.be/Wuf0wE3Mrxg
Presentation with short explanation: https://yadi.sk/i/oDYnpvMhqi8a7w
Competition: https://kaggle.com/c/siim-acr-pneumothorax-segmentation
Model Zoo
- AlbuNet (resnet34) from [ternausnets]
- Resnet50 from [selim_sef SpaceNet 4]
- SCSEUnet (seresnext50) from [selim_sef SpaceNet 4]
Main Features
Triplet scheme of inference and validation
Let our segmentation model output some mask with probabilities of pneumothorax pixels. I'm going to name this mask as a basic sigmoid mask. I used triplet of different thresholds: (top_score_threshold, min_contour_area, bottom_score_threshold)
The decision rule is based on a doublet (top_score_threshold, min_contour_area). I used it instead of using the classification of pneumothorax/non-pneumothorax.
- top_score_threshold is simple binarization threshold and transform basic sigmoid mask into a discrete mask of zeros and ones.
- min_contour_area is the maximum allowed number of pixels with a value greater than top_score_threshold
Those images that didn't pass this doublet of thresholds were counted non-pneumothorax images.
For the remaining pneumothorax images, we binarize basic sigmoid mask using bottom_score_threshold (another binariztion threshold, less then top_score_threshold). You may notice that most participants used the same scheme under the assumption that bottom_score_threshold = top_score_threshold.
The simplified version of this scheme:
classification_mask = predicted > top_score_threshold
mask = predicted.copy()
mask[classification_mask.sum(axis=(1,2,3)) < min_contour_area, :,:,:] = np.zeros_like(predicted[0])
mask = mask > bot_score_threshold
return mask
Search best triplet thresholds during validation
- Best triplet on validation: (0.75, 2000, 0.3).
- Best triplet on Public Leaderboard: (0.7, 600, 0.3)
For my final submissions I chose something between these triplets.
Combo loss
Used [combo loss] combinations of BCE, dice and focal. In the best experiments the weights of (BCE, dice, focal), that I used were:
- (3,1,4) for albunet_valid and seunet;
- (1,1,1) for albunet_public;
- (2,1,2) for resnet50.
Why exactly these weights?
In the beginning, I trained using only 1-1-1 scheme and this way I get my best public score.
I noticed that in older epochs, Dice loss is higher than the rest about 10 times.
For balancing them I decide to use a 3-1-4 scheme and it got me the best validation score.
As a compromise I chose 2-1-2 scheme for resnet50)
Sliding sample rate
Let's name portion of pneumothorax images as sample rate.
The main idea is control this portion using sampler of torch dataset.
On each epoch, my sampler gets all images from a dataset with pneumothorax and sample some from non-pneumothorax according to this sample rate. During train process, we reduce this parameter from 0.8 on start to 0.4 in the end.
Large sample rate at the beginning provides a quick start of the learning process, whereas a small sample rate at the end provides better convergence of neural network weights to the initial distribution of pneumothorax/non-pneumothorax images.
Learning Process recipes
I can't provide a fully reproducible solution because during learning process I was uptrain my models A LOT. But looking back for the formalization of my experiments I can highlight 4 different parts:
- part 0 - train for 10-12 epoches from pretrained model with large learning rate (about 1e-3 or 1e-4), large sample rate (0.8) and ReduceLROnPlateau scheduler. The model can be pretrained on imagenet or on our dataset with lower resolution (512x512). The goal of this part: quickly get a good enough model with validation score about 0.835.
- part 1 - uptrain the best model from the previous step with normal learning rate (~1e-5), large sample rate (0.6) and CosineAnnealingLR or CosineAnnealingWarmRestarts scheduler. Repeat until best convergence.
- part 2 - uptrain the best model from the previous step with normal learning rate (~1e-5), small sample rate (0.4) and CosineAnnealingLR or CosineAnnealingWarmRestarts scheduler. Repeat until best convergence.
- second stage - simple uptrain with relatively small learning rate(1e-5 or 1e-6), small sample rate (0.5) and CosineAnnealingLR or CosineAnnealingWarmRestarts scheduler.
All these parts are presented in the corresponding experiment folder
Augmentations
Used following transforms from [albumentations]
albu.Compose([
albu.HorizontalFlip(),
albu.OneOf([
albu.RandomContrast(),
albu.RandomGamma(),
albu.RandomBrightness(),
], p=0.3),
albu.OneOf([
albu.ElasticTransform(alpha=120, sigma=120 * 0.05, alpha_affine=120 * 0.03),
albu.GridDistortion(),
albu.OpticalDistortion(distort_limit=2, shift_limit=0.5),
], p=0.3),
albu.ShiftScaleRotate(),
albu.Resize(img_size,img_size,always_apply=True),
])
Uptrain from lower resolution
All experiments (except resnet50) uptrained on size 1024x1024 after 512x512 with frozen encoder on early epoches.
Second stage uptrain
All choosen experiments was uptrained on second stage data
Checkpoints averaging
Top3 checkpoints averaging from each fold from each pipeline on inference
Small batchsize without accumulation
A batch size of 2-4 pictures is enough and all my experiments were run on one (sometimes two) 1080-Ti.
Horizontal flip TTA
File structure
โโโ unet_pipeline
โ โโโ experiments
โ โ โโโ some_experiment
โ โ โ โโโ train_config.yaml
โ โ โ โโโ inference_config.yaml
โ โ โ โโโ submit_config.yaml
โ โ โ โโโ checkpoints
โ โ โ โ โโโ fold_i
โ โ โ โ โ โโโtopk_checkpoint_from_fold_i_epoch_k.pth
โ โ โ โ โ โโโsummary.csv
โ โ โ โ โโโbest_checkpoint_from_fold_i.pth
โ โ โ โโโ log
โโโ input
โ โโโ dicom_train
โ โ โโโ some_folder
โ โ โ โโโ some_folder
โ โ โ โ โโโ some_train_file.dcm
โ โโโ dicom_test
โ โ โโโ some_folder
โ โ โ โโโ some_folder
โ โ โ โ โโโ some_test_file.dcm
| โโโ new_sample_submission.csv
โ โโโ new_train_rle.csv
โโโ requirements.txt
Install
pip install -r requirements.txt
Data Preparation
You need to paste your own names of input data folders and rle_fole
cd unet_pipeline/utils
python prepare_png.py -img_size 1024 -train_path ../../input/dicom_train test_path ../../input/dicom_test -out_path ../../input/dataset1024 -rle_path ../../input/new_train_rle.csv -n_threads 8
Pipeline launch example
Training:
cd unet_pipeline
python Train.py experiments/albunet_valid/train_config_part0.yaml
python Train.py experiments/albunet_valid/train_config_part1.yaml
python Train.py experiments/albunet_valid/train_config_part2.yaml
python Train.py experiments/albunet_valid/train_config_2nd_stage.yaml
As an output, we get a checkpoints in corresponding folder.
Inference:
cd unet_pipeline
python Inference.py experiments/albunet_valid/2nd_stage_inference.yaml
As an output, we get a pickle-file with mapping the file name into a mask with pneumothorax probabilities.
Submit:
cd unet_pipeline
python TripletSubmit.py experiments/albunet_valid/2nd_stage_submit.yaml
As an output, we get submission file with rle.
Best experiments:
- albunet_public - best model for Public Leaderboard
- albunet_valid - best resnet34 model on validation
- seunet - best seresnext50 model on validation
- resnet50 - best resnet50 model on validation
Final Submission
My best model for Public Leaderboard was albunet_public (PL: 0.8871), and score of all ensembling models was worse. But I suspected overfitting for this model therefore both final submissions were ensembles.
- First ensemble believed in Public Leaderboard scores more and used more "weak" triplet thresholds.
- Second ensemble believed in the validation scores more, but used more "strict" triplet thresholds.
Private Leaderboard:
- 0.8679
- 0.8641
I suspect that the best solution would be ensemble believed in the validation scores more, but used more "weak" triplet thresholds.