Cornell Birdcall Identification Competition

1st Place solution to the Cornell Birdcall Identification competition hosted on Kaggle.

Context

In this competition, you will identify a wide variety of bird vocalizations in soundscape recordings. Due to the complexity of the recordings, they contain weak labels. There might be anthropogenic sounds (e.g., airplane overflights) or other bird and non-bird (e.g., chipmunk) calls in the background, with a particular labeled bird species in the foreground. Bring your new ideas to build effective detectors and classifiers for analyzing complex soundscape recordings!

Evaluation

The hidden test_audio directory contains approximately 150 recordings in mp3 format, each roughly 10 minutes long. They will not all fit in a notebook's memory at the same time. The recordings were taken at three separate remote locations in North America. Sites 1 and 2 were labeled in 5 second increments and need matching predictions, but due to the time consuming nature of the labeling process the site 3 files are only labeled at the file level. Accordingly, site 3 has relatively few rows in the test set and needs lower time resolution predictions.

Scores were evaluated based on their row-wise micro averaged F1 score.

Solution

My approach used a Sound Event Detection approach described in PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. Kaggle user hidehisaarai1213 provides a good explaination of how it works in the Kaggle kernel introduction to sound event detection.

Main Model Differences

Switched the CNN Feature extractor with a pretrained Densenet121 model
Replaced the torch.clamp method with torch.tanh in the attention layer.
Reduced the AttBlock size to 1024.

The reasons for making the changes was to avoid overfitting because of the lack of data with less than or equal to 100 labelled samples per bird class.

Data Augmentation

Audiomentation library provided an easy way to add data augmentation to the audio samples. The following augmentation was applied during training:

AddGaussianNoise
AddGaussianSNR
Gain
AddBackgroundNoise (based on pre-generated 1 minute samples of pink noise)
AddShortNoises (based on pre-generated 1 minute samples of pink noise)

Inference

Model ensembling by voting and thresholds on both clipwise_output and framewise_output was key to reducing the number of false positives and maximising the f1-score.

4 fold models (without mixup)
5 fold models (without mixup)
4 fold models (with mixup)

2 submissions were allowed to be selected before the Private Leaderboard was revealed. My top ensemble was if 4 out of the 13 models predicted a bird with a threshold of 0.3 for both clipwise_output and framewise_output was within the audio snippet then it would be accepted as a valid prediction. This ensemble was not a the highest model on the public leaderboard but I selected it based on what I felt would be the most confident in as 3 votes felt too low for 13 models (highest on Public Leaderboard). With this ensemble I was able to jump from 7th on the Public Leaderboard to 1st on the Private Leaderboard.

My second selection was based on 9 models (the models without mixup). My top scoring ensemble with 9 models was with 3 votes and a threshold of 0.5 for the frame threshold and 0.3 for the clip threshold. I didn't try a threshold of 0.3 for the frame threshold since I ran out of submissions (max 2 submissions per day). The selected 9 model ensemble would have put me at 3rd position on the Private Leaderboard.

13 Models	Clip Threshold	Frame Threshold	Public Leaderboard	Private Leaderboard	Selected
4 votes	0.3	0.3	0.616	0.681	x
4 votes	0.3	0.5	0.615	0.679
5 votes	0.3	0.3	0.609	0.679
5 votes	0.3	0.5	0.606	0.676
3 votes	0.3	0.3	0.617	0.679
3 votes	0.3	0.5	0.614	0.679

9 Models	Clip Threshold	Frame Threshold	Public Leaderboard	Private Leaderboard	Selected
3 votes	0.3	0.5	0.613	0.676	x
2 votes	0.3	0.5	0.614	0.669
4 votes	0.3	0.5	0.610	0.675

Instructions

0. (Optional) GCP VM Setup

export IMAGE_FAMILY=pytorch-latest-gpu
export INSTANCE_NAME="pytorch-instance"
export ZONE="europe-west4-b"
export INSTANCE_TYPE="n1-standard-8"

gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator="type=nvidia-tesla-t4,count=1" \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=300GB \
        --metadata="install-nvidia-driver=True" \
        --preemptible

1. Run on first install

Requires pytorch preinstalled. Skip installation of NVIDIA apex in the config if the gpu does not support mixed precision training.

. startup.sh

NOTE: The sed training models don't support mixed precision training in the current state and will output NaN values. This can be fixed by adjusting the amin value but this is an untested change and may impact model performance. USE_AMP in the base_engine.py will also need to be set to True (currently disabled even if amp is installed).

2. Set Environment Variables

Both of these are optional:

export NEPTUNE_API_TOKEN=<ACCESS_TOKEN>
export SLACK_URL=<SLACK_WEBHOOK_URL>

Neptune is not required, set logger_name in the config path to a different name and create an different logger.

A Slack message notification will occur at the end of each epoch during training and validation with loss.

3. Training

python sed_train.py --config "config_params.example_config"

where config is the path to the Parameter class containing parameters for training. E.g. config_params.example_config.

Config files used for the final solution are available in the src/config_params/final_sed and src/config_params/final_sed_5_fold folder.

Note: training was completed on the resampled 32kHz wav equivalent of the training data provided on Kaggle, with the same folder structure. In order for the given configs to work the following instructions need to be followed

store the training data in /data
store the generated pink noise in /pinknoise
/background/data_ssw in the validation dataloader is data provided in the discuss here, with 5-30 second clips extracted where no bird call was found during the extracted time. Adding this is not required.
change the project_name in the config files to your personal neptune project name if using a neptune logger.