• This repository has been archived on 14/Nov/2020
  • Stars
    star
    112
  • Rank 312,240 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

8th place solution (on Kaggle) to the Freesound General-Purpose Audio Tagging Challenge (DCASE 2018 - Task 2)

This repository contains the final solution that I used for the Freesound General-Purpose Audio Tagging Challenge on Kaggle. The model achieved 8th position on the final leaderboard, with a score (MAP@3) of 0.943289.

About the competition

The objective of the competition is to distinguish 41 different types of sounds using the provided wav files. Sounds in the dataset include things like musical instruments, human sounds, domestic sounds, and animals. Submissions were evaluated according to the Mean Average Precision @ 3 (MAP@3). For more information about the competition, refer to the Kaggle home page for this competition.

Acknowledgements

Thanks to Amlan Praharaj, Oleg Panichev and Aleksandrs Gehsbargs for the kernels they have shared. The kernels have helped me get started on the competition. Special thanks to @daisukelab for the sharing his observations about data preprocessing, augmentation and model architectures. His insights were crucial to my solution.

Solution

Note that this document only describes the final approach. For the list of things that I tried, please refer to this document.

Data preprocessing

Leading/trailing silence in the audio may not contain much information and thus not useful for the model. Hence, the very first preprocessing step is to remove this silence. librosa.effects.trim function was used to achieve this.

Log Mel-Spectrograms

Often in speech recognition tasks, MFCC features are constructed from the raw audio data. Since the current data contains non-human sounds as well, using the Log Mel-Spectrogram data is better compared to the MFCC representation. Log Mel-Spectrogram for all train and test samples was pre-computed, so that compute time can be saved during training and prediction (Disk is cheaper when compared to GPU).

Additional features

Inspired by few Kaggle kernels, summary statistics of multiple spectral and time based features were calculated. Since many of these features were correlated, these features were transformed using the Principle Component Analysis (PCA). Top 350 features were used while modeling (which amount to ~97% of the total variance).

Architecture

The model at its core, uses the MobileNetV2 architecture with few modifications. The input Log Mel-Spec data is sent to the MobileNetV2 after first passing the input through two 2D convolution layers. This is so that the single channel input can be converted into a 3 channel input. (Thanks to the FastAI forums for this tip) The output from the MobileNetV2 is then concatenated with the PCA features, and a series of Dense layers are used before the final softmax activation layer for output.

inp1 = Input(shape=(64, None, 1), name='mel')

x = BatchNormalization()(inp1)
x = Conv2D(10, kernel_size=(1, 1), padding='same', activation='relu')(x)
x = Conv2D(3, kernel_size=(1, 1), padding='same', activation='relu')(x)

mn = MobileNetV2(include_top=False)
mn.layers.pop(0)

mn_out = mn(x)
x = GlobalAveragePooling2D()(mn_out)

inp2 = Input(shape=(350,), name='pca')
y = BatchNormalization()(inp2)

x = concatenate([x, y], axis=-1)
x = Dense(1536, activation='relu')(x)
x = BatchNormalization()(x)
x = Dense(384, activation='relu')(x)
x = BatchNormalization()(x)
x = Dense(41, activation='softmax')(x)

model = Model(inputs=[inp1, inp2], outputs=x)
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
mel (InputLayer)                (None, 64, None, 1)  0                                            
__________________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 64, None, 1)  4           mel[0][0]                        
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 64, None, 10) 20          batch_normalization_1[0][0]      
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 64, None, 3)  33          conv2d_1[0][0]                   
__________________________________________________________________________________________________
mobilenetv2_1.00_224 (Model)    multiple             2257984     conv2d_2[0][0]                   
__________________________________________________________________________________________________
pca (InputLayer)                (None, 350)          0                                            
__________________________________________________________________________________________________
global_average_pooling2d_1 (Glo (None, 1280)         0           mobilenetv2_1.00_224[1][0]       
__________________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 350)          1400        pca[0][0]                        
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 1630)         0           global_average_pooling2d_1[0][0]
                                                                 batch_normalization_2[0][0]      
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 1536)         2505216     concatenate_1[0][0]              
__________________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 1536)         6144        dense_1[0][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 384)          590208      batch_normalization_3[0][0]      
__________________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 384)          1536        dense_2[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 41)           15785       batch_normalization_4[0][0]      
==================================================================================================
Total params: 5,378,330
Trainable params: 5,339,676
Non-trainable params: 38,654
__________________________________________________________________________________________________

Train data generation with augmentation

Both the train and test audio files are of varied length. Model is designed to make use of this particular nature of the dataset. Use of Global average pooling before the Dense layers allows the model to accept inputs of various lengths. While training, at each batch generation, a random integer between the limits is chosen. 25th and 75th percentiles of train file lengths are used as min and max limits respectively. Shorter length samples than the chosen length are padded, while a random span of chosen length is extracted from longer length samples.

The common augmentation practices for Image classification such as horizontal/vertical shift, horizontal flip were used. In addition to this, Random erasing was also used. Random erasing or Cutout selects a random rectangle in the image, and replaces it with adjacent or random values. For more information about this data augmentation technique, refer to the original paper.

Mixup is the final augmentation technique used while training. Mixup essentially takes pairs of data points, chosen randomly, and mixes them (both X and y) using a proportion chosen from Beta distribution.

One intuition behind this is that by linearly interpolating between datapoints, we incentivize the network to act smoothly and kind of interpolate nicely between datapoints - without sharp transitions. (Quote from https://www.inference.vc/mixup-data-dependent-data-augmentation/)

While I haven't ran exhaustive trials to say for sure, anecdotally, each of the data augmentation have helped in improving the loss.

Training

Ten folds (stratified split as there is class imbalance) were generated. For each fold, a model of similar architecture but that uses only Log Mel-Spectrogram data is trained. The weights from this model are loaded into the whole model (that uses both mel and pca features), and training process continues. Attempt to train the model without using this two-stage approach didn't result in as good a model as before.

Predictions on test data

Six different lengths selected at equal intervals between the 25th and 75th percentile of train file lengths. To make use of the higher amount of information present in longer length samples, at each length, predictions are generated five times. Each time a random span of specified length is extracted from longer (than specified length) length samples. 10 (folds) x 6 (lengths) x 5 (tries) gives 300 sets of predictions for the test data. All of these predictions were combined using geometric mean, and top 3 predicted classes for each data point are selected for submission.

Reproducing the results

Download and keep the data files in the ./data/ folder and run the bash script ./run_all.sh. Script was tested on Ubuntu 16.04.

More Repositories

1

i3-wm-config

I3 tiling window manager configuration
Shell
137
star
2

i3-wm-multi-disp-scripts

Scripts to navigate a multi-monitor setup in I3 WM
Python
45
star
3

dcase2019-task5-urban-sound-tagging

1st place solution to the DCASE 2019 - Task 5 - Urban Sound Tagging
Python
30
star
4

spotify-sequential-skip-prediction

7th place solution to the WSDM Cup 2019 - Spotify - Sequential Skip Prediction Challenge
Jupyter Notebook
26
star
5

stubthat

Stubbing framework for R
R
17
star
6

kaggle-carvana-image-masking-challenge

Top 15% ranked solution to the Carvana Image Masking Challenge on Kaggle
Jupyter Notebook
15
star
7

mediaeval-2019-moodtheme-detection

4th position solution to the MediaEval - The 2019 Emotion and Themes in Music using Jamendo
Jupyter Notebook
14
star
8

reddit2mobi

Export a reddit post into a mobi/epub via html.
HTML
10
star
9

emacs-spacemacs-config

Emacs (Spacemacs) configuration
Emacs Lisp
6
star
10

nlp-tutorials-notes

Jupyter Notebook
4
star
11

sway-wm-multi-disp-scripts

Sway version of my i3-wm-multi-disp-scripts
Python
3
star
12

sainathadapa.github.io

Source code for my blog
HTML
3
star
13

population-pyramid-states-india

Population pyramid charts for the States of India
R
2
star
14

auto-redirect-nbviewer

Auto-Redirect Github/Gitlab ipynb urls to NBViewer - Google Chrome extension (wip)
JavaScript
2
star
15

plotting-cyclones

Cyclones in The Bay of Bengal
R
1
star
16

swaywm-config

Sway window manager configuration
Shell
1
star
17

spark-course-files

1
star
18

r-snippets

Personal R functions/scripts
R
1
star
19

ams-hackathon

1st place solution during the hackathon on Image recognition by the Gemeente Amsterdam
Jupyter Notebook
1
star
20

nodejs-r-child-process

Running R as a child process from NodeJS
JavaScript
1
star
21

kaggle-invasive-species-monitoring

6th place solution to the Kaggle - Invasive Species Monitoring
Jupyter Notebook
1
star