• Stars
    star
    202
  • Rank 193,691 (Top 4 %)
  • Language
    Python
  • License
    Other
  • Created over 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

key_visual

BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

This is a demo implementation of BYOL for Audio (BYOL-A), a self-supervised learning method for general-purpose audio representation, includes:

  • Training code that can train models with arbitrary audio files.
  • Evaluation code that can evaluate trained models with downstream tasks.
  • Pretrained weights.

UPDATE (Dec, 2022): The v2 is now published on TASLP! We updated BibTeX.

UPDATE (Nov, 2022): New model definitions (AudioNTT2020X, AudioNTT2020Task6X) are ready. These are for making all layer features accessible so that the weighted sum of layer features can be available in SUPERB.

UPDATE (May, 2022): We have two papers for BYOL-A. If you find BYOL-A useful in your research, please use either of the following BibTeX entries for citation. The former is the first paper from IJCNN2021 (LINK to IEEE Xplore), and the latter is currently under review (LINK to arxiv) published on TASLP!

@inproceedings{niizumi2021byol-a,
    title={BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation},
    author={Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    booktitle = {2021 International Joint Conference on Neural Networks (IJCNN)},
    publisher={IEEE},
    DOI={10.1109/ijcnn52387.2021.9534474},
    url={http://dx.doi.org/10.1109/IJCNN52387.2021.9534474},
    year={2021},
    month={Jul}
}
@article{niizumi2023byol-a,
    title={{BYOL for Audio}: Exploring Pre-trained General-purpose Audio Representations},
    author={Niizumi, Daisuke and Takeuchi, Daiki and Ohishi, Yasunori and Harada, Noboru and Kashino, Kunio},
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
    publisher={Institute of Electrical and Electronics Engineers (IEEE)},
    year={2023},
    volume={31},
    pages={137–151},
    doi={10.1109/TASLP.2022.3221007},
    url={http://dx.doi.org/10.1109/TASLP.2022.3221007},
    ISSN={2329-9304}
}

What is the difference between the papers?

We've added an augmentation block and updated network architecture.

We introduced an extra augmentation block, Random Linear Fader, in the 2022 version (TASLP2023).

aug-history

We reduced the number of convolutional blocks from three to two and added a skip connection at a new Concat block on the 2022 version.

network-history

  • For IJCNN2021, codes have not been changed; please find the details in this README.
  • For TASLP2023(2022 version), πŸ‘‰ please find codes in the v2 folder.

Getting Started

  1. Download external source files, and apply a patch. Our implementation uses the following.

    curl -O https://raw.githubusercontent.com/lucidrains/byol-pytorch/2aa84ee18fafecaf35637da4657f92619e83876d/byol_pytorch/byol_pytorch.py
    patch < byol_a/byol_pytorch.diff
    mv byol_pytorch.py byol_a
    curl -O https://raw.githubusercontent.com/daisukelab/general-learning/7b31d31637d73e1a74aec3930793bd5175b64126/MLP/torch_mlp_clf.py
    mv torch_mlp_clf.py utils
  2. Install PyTorch 1.7.1, torchaudio, and other dependencies listed on requirements.txt.

Evaluating BYOL-A Representations

Downstream Task Evaluation

The following steps will perform a downstream task evaluation by linear-probe fashion. This is an example with SPCV2; Speech commands dataset v2.

  1. Preprocess metadata (.csv file) and audio files, processed files will be stored under a folder work.

    # usage: python -m utils.preprocess_ds <downstream task> <path to its dataset>
    python -m utils.preprocess_ds spcv2 /path/to/speech_commands_v0.02
  2. Run evaluation. This will convert all .wav audio to representation embeddings first, train a lineaer layer network, then calculate accuracy as a result.

    python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth spcv2

You can also run an evaluation multiple times and take an average result. Following will evaluate on UrbanSound8K with a unit audio duration of 4.0 seconds, for 10 times.

# usage: python evaluate.py <your weight> <downstream task> <unit duration sec.> <# of iteration>
python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth us8k 4.0 10

Similarly, the following evaluates on NSynth (4.0 seconds long) 10 times.

python evaluate.py pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth nsynth 4.0 10

Evaluating Representations In Your Tasks

This is an example to calculate a feature vector for an audio sample.

from byol_a.common import *
from byol_a.augmentations import PrecomputedNorm
from byol_a.models import AudioNTT2020


device = torch.device('cuda')
cfg = load_yaml_config('config.yaml')
print(cfg)

# ** Prepare the statistics in advance **
# You need to calculate the statistics of mean and standard deviation of the log-mel spectrogram of your dataset.
# See calc_norm_stats in evaluate.py for your reference.
stats = [-5.4919195,  5.0389895]

# Preprocessor and normalizer.
to_melspec = torchaudio.transforms.MelSpectrogram(
    sample_rate=cfg.sample_rate,
    n_fft=cfg.n_fft,
    win_length=cfg.win_length,
    hop_length=cfg.hop_length,
    n_mels=cfg.n_mels,
    f_min=cfg.f_min,
    f_max=cfg.f_max,
)
normalizer = PrecomputedNorm(stats)

# Load pretrained weights.
model = AudioNTT2020(d=cfg.feature_d)
model.load_weight('pretrained_weights/AudioNTT2020-BYOLA-64x96d2048.pth', device)

# Load your audio file.
wav, sr = torchaudio.load('work/16k/spcv2/one/00176480_nohash_0.wav') # a sample from SPCV2 for now
assert sr == cfg.sample_rate, "Let's convert the audio sampling rate in advance, or do it here online."

# Convert to a log-mel spectrogram, then normalize.
lms = normalizer((to_melspec(wav) + torch.finfo(torch.float).eps).log())

# Now, convert the audio to the representation.
features = model(lms.unsqueeze(0))

Training From Scratch

You can also train models. Followings are an example of training on FSD50K.

  1. Convert all samples to 16kHz. This will convert all FSD50K files to a folder work/16k/fsd50k while preserving folder structure.

    python -m utils.convert_wav /path/to/fsd50k work/16k/fsd50k
  2. Start training, this example trains with all development set audio samples from FSD50K.

    python train.py work/16k/fsd50k/FSD50K.dev_audio

Refer to Table VI on our paper for the performance of a model trained on FSD50K.

Pretrained Weights

We include 3 pretrained weights of our encoder network.

Method Dim. Filename NSynth US8K VoxCeleb1 VoxForge SPCV2/12 SPCV2 Average
BYOL-A 512-d AudioNTT2020-BYOLA-64x96d512.pth 69.1% 78.2% 33.4% 83.5% 86.5% 88.9% 73.3%
BYOL-A 1024-d AudioNTT2020-BYOLA-64x96d1024.pth 72.7% 78.2% 38.0% 88.5% 90.1% 91.4% 76.5%
BYOL-A 2048-d AudioNTT2020-BYOLA-64x96d2048.pth 74.1% 79.1% 40.1% 90.2% 91.0% 92.2% 77.8%

License

This implementation is for your evaluation of BYOL-A paper, see LICENSE for the detail.

Acknowledgements

BYOL-A is built on top of byol-pytorch, a BYOL implementation by Phil Wang (@lucidrains). We thank Phil for open-source sophisticated code.

@misc{wang2020byol-pytorch,
  author =       {Phil Wang},
  title =        {Bootstrap Your Own Latent (BYOL), in Pytorch},
  howpublished = {\url{https://github.com/lucidrains/byol-pytorch}},
  year =         {2020}
}

References

More Repositories

1

japanese-dialog-transformers

Code for evaluating Japanese pretrained models provided by NTT Ltd.
Python
239
star
2

msm-mae

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representations
Jupyter Notebook
86
star
3

m2d

Masked Modeling Duo: Towards a Universal Audio Pre-training Framework
Jupyter Notebook
58
star
4

dcase2023_task2_baseline_ae

Python
50
star
5

eval-audio-repr

EVAR ~ Evaluation package for Audio Representations
Python
40
star
6

composing-general-audio-repr

Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model
Jupyter Notebook
26
star
7

edge-consensus-learning

P2P Distributed deep learning framework that runs on PyTorch.
Python
22
star
8

ToyADMOS2-dataset

ToyADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions πŸš— πŸšƒ
Python
14
star
9

Generalized-Domain-Adaptation

Python
12
star
10

dcase2023_task2_evaluator

Python
10
star
11

rg-cache

Code for the paper "Reflectance-guided, contrast-accumulated histogram equalization" published in ICASSP 2020.
MATLAB
9
star
12

baxter-permutation-process

Bayesian nonparametric relational data analysis based on Baxter permutation process
Python
7
star
13

Learning-with-Selective-Forgetting

This is an official PyTorch impelementation of our paper "Learning with Selective Forgetting" (IJCAI21).
Python
6
star
14

rope

Code of the paper "Reflectance-oriented probabilistic equalization for image enhancement" published in ICASSP 2021.
MATLAB
5
star
15

japanese-long-term-chat

5
star
16

permuton-induced-crp

Bayesian inference for Permuton-induced Chinese Restaurant Process (NeurIPS2021).
MATLAB
4
star
17

AccLearningONN

Acceleration method for learning fine-layered optical neural networks
Python
4
star
18

dcase2024_task2_evaluator

Python
4
star
19

ecl-isvr

P2P Distributed deep learning framework that runs on PyTorch.
Python
3
star
20

floor-padding-BO

Python
3
star
21

cone

Official PyTorch Implementation of "Deep Quantigraphic Image Enhancement via Comparametric Equations" (ICASSP2023)
Python
3
star
22

DET_AFP

This is the official PyTorch impelementation of our paper "Robustizing Object Detection Networks Using Augmented Feature Pooling" (ACCV2022, Oral).
Python
3
star
23

audio-diff-caps

Python
2
star
24

apwd-dataset

2
star
25

deep-sound-field-denoiser

Python
2
star
26

diff-eq-comput-zdd

An implementation of "Differentiable Equilibrium Computation with Decision Diagrams for Stackelberg Models of Combinatorial Congestion Games" in NeurIPS 2021.
C++
2
star
27

adaptive-leveling-BO

Python
1
star
28

fast-ecp-ecn

An implementation of "A Fast and Exact Evaluation Algorithm for the Expected Number of Connected Nodes: an Enhanced Network Reliability Measure" in IEEE INFOCOM 2023.
C++
1
star
29

cs-reliability

An implementation of "Efficient Network Reliability Evaluation for Client-Server Model" in IEEE GLOBECOM 2021.
C++
1
star
30

time-considerable-dialogue-model

This repository provides the information necessary to evaluate time-considerable dialogue models.
1
star