• Stars
    star
    259
  • Rank 157,669 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Metadata, scripts and baselines for the MTG-Jamendo dataset

The MTG-Jamendo Dataset

DOI

We present the MTG-Jamendo Dataset, a new open dataset for music auto-tagging. It is built using music available at Jamendo under Creative Commons licenses and tags provided by content uploaders. The dataset contains over 55,000 full audio tracks with 195 tags from genre, instrument, and mood/theme categories. We provide elaborated data splits for researchers and report the performance of a simple baseline approach on five different sets of tags: genre, instrument, mood/theme, top-50, and overall.

This repository contains metadata, scripts, instructions on how to download and use the dataset and reproduce baseline results.

A subset of the dataset was used in the Emotion and Theme Recognition in Music Task within MediaEval 2019-2021.

Structure

Metadata files in data

Pre-processing

  • raw.tsv (56,639) - raw file without postprocessing
  • raw_30s.tsv(55,701) - tracks with duration more than 30s
  • raw_30s_cleantags.tsv(55,701) - with tags merged according to tag_map.json
  • raw_30s_cleantags_50artists.tsv(55,609) - with tags that have at least 50 unique artists
  • tag_map.json - map of tags that we merged
  • tags_top50.txt - list of top 50 tags
  • autotagging.tsv = raw_30sec_cleantags_50artists.tsv - base file for autotagging (after all postprocessing, 195 tags)

Subsets

  • autotagging_top50tags.tsv (54,380) - only top 50 tags according to tag frequency in terms of tracks
  • autotagging_genre.tsv (55,215) - only tracks with genre tags (95 tags), and only those tags
  • autotagging_instrument.tsv (25,135) - instrument tags (41 tags)
  • autotagging_moodtheme.tsv (18,486) - mood/theme tags (59 tags)

Splits

  • splits folder contains training/validation/testing sets for autotagging.tsv and subsets

Note: A few tags are discarded in the splits to guarantee the same list of tags across all splits. For autotagging.tsv, this results in 55,525 tracks annotated by 87 genre tags, 40 instrument tags, and 56 mood/theme tags available in the splits.

Splits are generated from autotagging.tsv, containing all tags. For each split, the related subsets (top50, genre, instrument, mood/theme) are built filtering out unrelated tags and tracks without any tags.

Some additional metadata from Jamendo (artist, album name, track title, release date, track URL) is available in raw.meta.tsv (56,693).

Statistics in stats

Top 20 tags per category

Statistics of number of tracks, albums and artists per tag sorted by number of artists. Each directory has statistics for metadata file with the same name. Here is the statistics for the autotagging set. Statistics for subsets based on categories are not kept seperated due to it already included in autotagging.

Using the dataset

Requirements

  • Python 3.7+
  • Create virtual environment and install requirements
python -m venv venv
source venv/bin/activate
pip install -r scripts/requirements.txt

The original requirements are kept in reguirements-orig.txt

Downloading the data

All audio is distributed in 320kbps MP3 format. We recommend using this version of audio by default. For smaller download sizes, we also provide a lower-bitrate mono version of the same audio (converted from the full quality version to mono LAME VBR 2 lame -V 2). In addition we provide precomputed mel-spectrograms which are distributed as NumPy Arrays in NPY format. We also provide precomputed statistical features from Essentia (used in the AcousticBrainz music database) in JSON format. The audio files and the NPY/JSON files are split into folders packed into TAR archives. The dataset is hosted online at MTG UPF.

We provide the following data subsets:

  • raw_30s/audio - all available audio for raw_30s.tsv in full quality (508 GB)
  • raw_30s/audio-low - all available audio for raw_30s.tsv in low quality (156 GB)
  • raw_30s/melspecs - mel-spectrograms for raw_30s.tsv (229 GB)
  • autotagging-moodtheme/audio - audio for the mood/theme subset autotagging_moodtheme.tsv in full quality (152 GB)
  • autotagging-moodtheme/audio-low - audio for the mood/theme subset autotagging_moodtheme.tsv in low quality (46 GB)
  • autotagging-moodtheme/melspecs - mel-spectrograms for the autotagging_moodtheme.tsv subset (68 GB)

We provide a script to download and validate all files in the dataset. See its help message for more information:

python scripts/download/download.py -h
usage: download.py [-h] [--dataset {raw_30s,autotagging_moodtheme}]
                   [--type {audio,audio-low,melspecs,acousticbrainz}]
                   [--from {mtg,mtg-fast}] [--unpack] [--remove]
                   outputdir

Download the MTG-Jamendo dataset

positional arguments:
  outputdir             directory to store the dataset

options:
  -h, --help            show this help message and exit
  --dataset {raw_30s,autotagging_moodtheme}
                        dataset to download (default: raw_30s)
  --type {audio,audio-low,melspecs,acousticbrainz}
                        type of data to download (audio, audio in low quality,
                        mel-spectrograms, AcousticBrainz features) (default: audio)
  --from {mtg,mtg-fast}
                        download from MTG (server in Spain, slow),
                        or fast MTG mirror (Finland) (default: mtg-fast)
  --unpack              unpack tar archives (default: False)
  --remove              remove tar archives while unpacking one by one (use to
                        save disk space) (default: False)

For example, to download audio for the autotagging_moodtheme.tsv subset, unpack and validate all tar archives:

mkdir /path/to/download
python3 scripts/download/download.py --dataset autotagging_moodtheme --type audio /path/to/download --unpack --remove

Unpacking process is run after tar archive downloads are complete and validated. In the case of download errors, re-run the script to download missing files.

Due to the large size of the dataset, it can be useful to include the --remove flag to save disk space: in this case, tar archive are unpacked and immediately removed one by one.

Loading data in python

Assuming you are working in scripts folder

import commons

input_file = '../data/autotagging.tsv'
tracks, tags, extra = commons.read_file(input_file)

tracks is a dictionary with track_id as key and track data as value:

{
    1376256: {
    'artist_id': 490499,
    'album_id': 161779,
    'path': '56/1376256.mp3',
    'duration': 166.0,
    'tags': [
        'genre---easylistening',
        'genre---downtempo',
        'genre---chillout',
        'mood/theme---commercial',
        'mood/theme---corporate',
        'instrument---piano'
        ],
    'genre': {'chillout', 'downtempo', 'easylistening'},
    'mood/theme': {'commercial', 'corporate'},
    'instrument': {'piano'}
    }
    ...
}

tags contains mapping of tags to track_id:

{
    'genre': {
        'easylistening': {1376256, 1376257, ...},
        'downtempo': {1376256, 1376257, ...},
        ...
    },
    'mood/theme': {...},
    'instrument': {...}
}

extra has information that is useful to format output file, so pass it to write_file if you are using it, otherwise you can just ignore it

Reproduce postprocessing & statistics

  • Recompute statistics for raw and raw_30s
python scripts/get_statistics.py data/raw.tsv stats/raw
python scripts/get_statistics.py data/raw_30s.tsv stats/raw_30s
  • Clean tags and recompute statistics (raw_30s_cleantags)
python scripts/clean_tags.py data/raw_30s.tsv data/tag_map.json data/raw_30s_cleantags.tsv
python scripts/get_statistics.py data/raw_30s_cleantags.tsv stats/raw_30s_cleantags
  • Filter out tags with low number of unique artists and recompute statistics (raw_30s_cleantags_50artists)
python scripts/filter_fewartists.py data/raw_30s_cleantags.tsv 50 data/raw_30s_cleantags_50artists.tsv --stats-directory stats/raw_30s_cleantags_50artists
  • autotagging file in data and folder in stats is a symbolic link to raw_30s_cleantags_50artists

  • Visualize top 20 tags per category

python scripts/visualize_tags.py stats/autotagging 20  # generates top20.pdf figure

Recreate subsets

  • Create subset with only top50 tags by number of tracks
python scripts/filter_toptags.py data/autotagging.tsv 50 data/autotagging_top50tags.tsv --stats-directory stats/autotagging_top50tags --tag-list data/tags/tags_top50.txt
python scripts/split_filter_subset.py data/splits autotagging autotagging_top50tags --subset-file data/tags/top50.txt
  • Create subset with only mood/theme tags (or other category: genre, instrument)
python scripts/filter_category.py data/autotagging.tsv mood/theme data/autotagging_moodtheme.tsv --tag-list data/tags/moodtheme.txt
python scripts/split_filter_subset.py data/splits autotagging autotagging_moodtheme --category mood/theme 

Reproduce experiments

  • Preprocessing
python scripts/baseline/get_npy.py run 'your_path_to_spectrogram_npy'
  • Train
python scripts/baseline/main.py --mode 'TRAIN' 
  • Test
python scripts/baseline/main.py --mode 'TEST' 
optional arguments:
  --batch_size                batch size (default: 32)
  --mode {'TRAIN', 'TEST'}    train or test (default: 'TRAIN')
  --model_save_path           path to save trained models (default: './models')
  --audio_path                path of the dataset (default='/home')
  --split {0, 1, 2, 3, 4}     split of data to use (default=0)
  --subset {'all', 'genre', 'instrument', 'moodtheme', 'top50tags'}
                              subset to use (default='all')

Results

Related Datasets

The MTG-Jamendo Dataset can be linked to related datasets tailored to specific applications.

Music Classification Annotations

The Music Classification Annotations contains annotations for the split-0 test set according to the taxonomies of 15 existing music classification datasets including genres, moods, danceability, voice/instrumental, gender, and tonal/atonal. These labels are suitable for training individual classifiers or learning everything in a multi-label setup (auto-tagging). Most of the taxonomies were annotated by three different annotators. We provide the subset of annotations with perfect inter-annotator agreement ranging from 411 to 8756 tracks depending on the taxonomy.

Song Describer

Song Describer is a platform for crowdsourcing music captions (audio-text pairs) for audio tracks in MTG-Jamendo.

Research challenges using the dataset

Citing the dataset

Please consider citing the following publication when using the dataset:

Bogdanov, D., Won M., Tovstogan P., Porter A., & Serra X. (2019). The MTG-Jamendo Dataset for Automatic Music Tagging. Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019).

@conference {bogdanov2019mtg,
    author = "Bogdanov, Dmitry and Won, Minz and Tovstogan, Philip and Porter, Alastair and Serra, Xavier",
    title = "The MTG-Jamendo Dataset for Automatic Music Tagging",
    booktitle = "Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019)",
    year = "2019",
    address = "Long Beach, CA, United States",
    url = "http://hdl.handle.net/10230/42015"
}

An expanded version of the paper describing the dataset and the baselines will be announced later.

License

  • The code in this repository is licensed under Apache 2.0
  • The metadata is licensed under a CC BY-NC-SA 4.0.
  • The audio files are licensed under Creative Commons licenses, see individual licenses for details in audio_licenses.txt.

Copyright 2019-2022 Music Technology Group

Acknowledgments

This work was funded by the predoctoral grant MDM-2015-0502-17-2 from the Spanish Ministry of Economy and Competitiveness linked to the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502).

This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie SkΕ‚odowska-Curie grant agreement No. 765068.

This work has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 688382 "AudioCommons".

More Repositories

1

essentia

C++ library for audio and music analysis, description and synthesis, including Python bindings
C++
2,775
star
2

sms-tools

Sound analysis/synthesis tools for music applications
Python
1,620
star
3

essentia.js

JavaScript library for music/audio analysis and processing powered by Essentia WebAssembly
TypeScript
632
star
4

DeepConvSep

Deep Convolutional Neural Networks for Musical Source Separation
Python
469
star
5

freesound

The Freesound website
Python
309
star
6

gaia

C++ library to apply similarity measures and classifications on the results of audio analysis, including Python bindings. Together with Essentia it can be used to compute high-level descriptions of music.
C++
269
star
7

WGANSing

Multi-voice singing voice synthesis
Python
235
star
8

freesound-datasets

A platform for the collaborative creation of open audio collections labeled by humans and based on Freesound content.
Python
135
star
9

freesound-python

python client for the freesound API
Python
130
star
10

dunya-desktop

A modular, customizable and open-source desktop application for accessing and visualizing music data.
Python
89
star
11

SymbTr

Turkish Makam Music Symbolic Data Collection
Python
82
star
12

MIRCourse

python notebooks used in the MIR course of the SMC Master of the MTG-UPF
Jupyter Notebook
61
star
13

conferences

Music Technology / MIR conference and journal calls
SCSS
60
star
14

da-tacos

A Dataset for Cover Song Identification and Understanding
Python
56
star
15

miredu

A C++ Vamp plugin implementing basic audio descriptors for educational purposes
C++
49
star
16

DCASE-models

Python library for rapid prototyping of environmental sound analysis systems
Jupyter Notebook
42
star
17

ArabicTransliterator

A code for transliterating (romanizing) Arabic text using the American Library Association - Library of Congress (ALA-LC) standard
Python
42
star
18

essentia-replicate-demos

Demos of Essentia models hosted on Replicate.com
Python
37
star
19

homebrew-essentia

Homebrew build scripts for Essentia
Ruby
34
star
20

music-explore

App to explore latent spaces of music collections
Python
32
star
21

pycompmusic

Tools to help researchers work with Dunya and CompMusic
Python
31
star
22

JAAH

Python
31
star
23

PodcastMix-inference

Python
31
star
24

dunya

The Dunya music browser
Python
29
star
25

acousticbrainz-client

A client to upload data to an acousticbrainz server
Python
29
star
26

essentia-docker

Docker images for Essentia
Roff
28
star
27

freesound-juce

A JUCE client for accessing the Freesound API
C++
28
star
28

acousticbrainz-labs

Cool hacks using acousticbrainz
Jupyter Notebook
25
star
29

acousticbrainz-genre-dataset

The AcousticBrainz Genre Dataset
TeX
24
star
30

essentia-tutorial

A tutorial for using Essentia in Python
Jupyter Notebook
23
star
31

singing-synthesis-demos

Sound examples for the Neural Parametric Singing Synthesizer (NPSS)
HTML
22
star
32

MIR-toolbox-docker

This project provides a docker image to run a jupyter notebook server with essentia, freesound-python and a set of python dependencies commonly used in Music Information Retrieval (MIR).
Dockerfile
22
star
33

github-traffic

Save information about traffic to a GitHub repository
Python
21
star
34

violin-transcription

High-Resolution Violin Transcription using Weak Labels
Jupyter Notebook
20
star
35

tape

TAPE: An End-to-End Timbre-Aware Pitch Estimator
Jupyter Notebook
18
star
36

jingjuPhonemeAnnotation

Hierarchical annotation - line (phrase), syllable, phoneme annotations of the jingju (Beijing opera) a-cappella singing dataset
Python
17
star
37

compIAM

Common tools for the computational analysis of Indian Art Music
Jupyter Notebook
16
star
38

otmm_makam_recognition_dataset

A dataset of Ottoman-Turkish makam music to test makam recognition (and tonic identification) methodologies
Jupyter Notebook
13
star
39

musav-dataset

MusAV: a dataset of relative arousal-valence annotations for validation of audio models
Python
13
star
40

Ismir2018TutorialNotebooks

Jupyter notebooks for Ismir-2018 tutorial titled "Computational approaches for analysis of non-Western music traditions" by Serra, Clayton and Bozkurt
Jupyter Notebook
13
star
41

turkish-makam-acapella-sections-dataset

Clean singing voice with no accompaniment. Semiprofessional singers. Semiprofessional quality. Songs from classical turkish makam in şarkı form. Recorded in studios in Istanbul in June 2014. Annotated on word and phoneme level.
12
star
42

acousticbrainz-gui

C++
12
star
43

similarity-annotator

An annotation tool for sound segmentation and similarity
Python
11
star
44

content_choral_separation

Python
11
star
45

otmm_tonic_dataset

The tonic test dataset for classical Ottoman-Turkish makam music
Jupyter Notebook
10
star
46

otmm_audio_score_alignment_dataset

The Audio Score Alignment Test dataset for Ottoman-Turkish makam music
MATLAB
10
star
47

carnatic-separation-ismir23

Carnatic singing voice separation trained with in-domain data with leakage
Python
10
star
48

SLVision

Developed in c++ with the OpenCV libraries, SLVision is a vision tracking software developed for SecondLight. It tracks 6DoF Markers, hands and fingers and sends the tracked data by using TUIO2 Messages trough a TCP socket to a client application.
C++
9
star
49

freesound-labs

Source code repository for the Freesound Labs. Freesound Labs lists projects and activities related to Freesound.
JavaScript
9
star
50

playlists-stat-analysis

Tools for Analyzing Popularity and Semantic Diversity of a Playlist Dataset
Python
9
star
51

SingWithExpressions

This is the accompanying repository to the paper - Automatic Estimation of Singing Voice Musical Dynamics
9
star
52

echonest-backup

A backup of EchoNest data exposed in the Million Song Dataset
Python
8
star
53

matlab-c-tools

Tools and tutorials for calling C and C++ code from Matlab
C++
8
star
54

essentia.js-tutorial-wac2021

Essentia.js tutorial at Web Audio Conference 2021
8
star
55

metadb

A simple database containing metadata linked to musicbrainz ids
Python
8
star
56

saraga

The companion repository of Saraga collections, with a companion website, a dump of the dataset, documentation, utility scripts and python notebooks to access and interact with the dataset
Jupyter Notebook
7
star
57

pymtg

Python research utils that some of us use at the MTG and eventually everyone will use :)
Python
7
star
58

hands-free-sound-machine

Demo application for the MusicBricks project, combining Ircam's RIoT sensor with MTG's Freesound API.
Python
7
star
59

smc-2016

Beijing opera singing intonation analysis
Python
6
star
60

ChoralSynth

Jupyter Notebook
5
star
61

melon-playlist-dataset

5
star
62

essentia-models-extraction

Batch extractor for melspectrograms, embeddings, and activations for the Essentia models.
Python
5
star
63

IAM-tutorial-ismir22

Webbook source code for ISMIR 2022 Tutorial: Computational Methods for Supporting Corpus-Based Research on Indian Art Music
Jupyter Notebook
5
star
64

andalusian-corpus-notebooks

Python
4
star
65

mtg-jamendo-annotator

A web app for annotating the MTG Jamendo dataset
HTML
4
star
66

CIPI

Python
4
star
67

SymbTr-extras

Basic tools to manipulate the SymbTr-scores
Jupyter Notebook
4
star
68

kaldi

Kaldi installation
Shell
4
star
69

phonos-music-explorer

Web real-time application for interactively exploring a collection of music
Python
4
star
70

SingingChoralSepAnalyzeSynthRemix

Python
4
star
71

SymbTr-pdf

The symbTr-scores in pdf format
4
star
72

makam-symbolic-phrase-segmentation

Automatic Phrase Segmentation on symbolic scores for Ottoman-Turkish makam music
MATLAB
4
star
73

metaverse1-soundscape-rendering

This repository contains code and documentation for the soundscape rendering application developed at the Music Technology Group for a virtual tourism use case within the Metaverse1 project.
SuperCollider
3
star
74

essentia-builds

Docker images for building Essentia
Shell
3
star
75

essentia.js-benchmarks

Web app and scripts for benchmarking Essentia.js
JavaScript
3
star
76

otmm_symbolic_phrase_dataset

A training dataset of scores of Turkish makam music withphrase boundary annotations
Jupyter Notebook
3
star
77

autotagging-qa-playlists

Playlists for evaluation of auto-tagging models
Python
3
star
78

musav-annotator

A web app for annotating relative arousal/valence data
HTML
3
star
79

mtg-logos

Logos for projects by the MTG
3
star
80

otmm_section_dataset

The section test dataset for classical Ottoman-Turkish makam music
TeX
3
star
81

carnatikit

Common tools for the computational analysis of Carnatic Music
Python
3
star
82

essentia-models

Machine learning models used for the Essentia unit tests
PureBasic
3
star
83

melon-music-dataset

2
star
84

arab-andalusian-music

Jupyter Notebook
2
star
85

otmm_composition_identification_dataset

Composition Identification dataset for Ottoman-Turkish Makam Music
2
star
86

essentia-models-benchmark

Scripts to benchmark the Essentia Models
Python
2
star
87

Jingju-Scores-Analysis

A collection of tools for extracting statistical information from the Jingju Music Scores Collection
Python
2
star
88

carnatic-pitch-patterns

Python
1
star
89

beijing-opera-intonation

Python
1
star
90

content_based_singing_extraction

Python
1
star
91

Jingju-Lyrics-Collection

Python
1
star
92

searching_for_sancaras

Python
1
star
93

cmbrowser-orig

The original version of the CompMusic browser
JavaScript
1
star
94

music-ner

Musical Named Entity Recognition System for Twitter
Python
1
star
95

otmm_tuning_intonation_dataset

A dataset of Turkish makam music to test tuning and intonation analysis methodologies
Jupyter Notebook
1
star
96

dunya-makam-demo

The binaries and example recordings picked for Dunya-makam demo
MATLAB
1
star
97

amplab-jamendo-notebook

Notebook + Essentia for AMPLab 2020-2022 projects
Dockerfile
1
star
98

essentia-audio

Audio and other binary assets used for the Essentia unit tests
ChucK
1
star
99

acousticbrainz-mediaeval-baselines

Baselines for MediaEval AcousticBrainz Genre Task
Python
1
star
100

essentia-robustness-ismir2014

Scripts to evaluate the robustness of descriptors to different encodings and analysis parameters
R
1
star