• Stars
    star
    160
  • Rank 234,703 (Top 5 %)
  • Language
    Jupyter Notebook
  • Created over 11 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code for creating a dataset of MIDI ground truth

MIDI Dataset

The goal of this project is to match and align a very large collection of MIDI files to a very large collection of audio files so that the MIDI data can be used to infer ground truth information about the audio. Alternatively, this repository contains code for reproducing most of the results in [1], which describes the goals, ideas, and research behind this project in much greater detail.

Notes

  • If you're looking for a high-level overview of the techniques used in this project and the results, take a look at chapter 1 of my thesis [1].

  • This repository contains code for performing the matching; if you're looking for the "Lakh MIDI Dataset" itself (the result of using this code to match a collection of 178,561 MIDI files to the Million Song Dataset), you can find that here.

  • If you just want a tutorial on potential uses of the Lakh MIDI dataset, take a look at the Tutorial.ipynb notebook.

  • Over time, this project has undergone some restructuring; if you're looking for the version of this repository used in the experiments in [2], check this tag.

Prerequisites

Before utilizing the code in this repository, you need to gather some data and software.

Data

Create a folder called data in the root of this repository. In it, you need the following subdirectories:

  • clean_midi, which should contain the "clean MIDI subset", as described in section 5.2.1 of [1]. These MIDI files should live in data/clean_midi/mid. You can obtain this collection here.
  • unique_midi, which should contain LMD-full, the 176,581 files of the Lakh MIDI dataset (aka LMD-full). These MIDI files should live in data/unique_midi/mid. You can obtain this collection here.
  • uspop2002, cal10k, cal500, and msd, which should each contain audio files from each respective dataset (msd being the 7digital preview clips corresponding to the Million Song Dataset). The MP3 files should live in, e.g., data/uspop2002/mp3. Unfortunately, obtaining these MP3 files is non-trivial. If you need help tracking them down, please contact me directly.

File lists

All of the datasets in the data subdirectory (except for unique_midi) should have a corresponding file list in the file_lists subdirectory. The only one which is not included in this repository is msd.txt; you can obtain that from the MSD directly (it's distributed with the MSD as unique_tracks.txt) or you can also download it here and rename msd.txt.

Software

All of the code in this repository is written for Python 2.7; it will likely need modification to work with Python 3.x. Here is a potentially incomplete list of the Python libraries used in this project:

  • numpy
  • scipy
  • librosa
  • pretty_midi
  • whoosh
  • joblib
  • deepdish
  • dhs
  • pse
  • msgpack
  • msgpack_numpy
  • lasagne
  • theano
  • sklearn
  • djitw
  • simple_spearmint
  • spearmint

Hardware

All of this code was designed to be run on a server with 64 GB of ram, 12 CPU cores, an NVIDIA GTX 980 Ti GPU, and plenty of hard drive space. If your own setup has less resources, you may need to modify some of the scripts in various places so that they use an appropriate amount of RAM, parallel processes, etc. In any case, please note that running all of the experiments and steps from beginning to end will take a least a few weeks of compute time.

Process

The general structure of this repository is as follows: Collections of shared utilities (experiment_utils.py, feature_extraction.py, whoosh_search.py) live in the base level, one-time-use scripts for assembling data and performing the actual MIDI-to-audio matching live in the scripts directory, and experiments for evaluating the effectiveness of different matching techniques live in experiments. Any data/results generated by running these different files are written out to a results directory. To re-run all of the experiments, matching, etc., proceed as described below.

  1. Run create_whoosh_indices.py. This uses the file lists to create Whoosh indices, which allow for fuzzy text matching of metadata. We use this fuzzy text matching to create training data for different matching algorithms. The indices are written out to, e.g., data/msd/index/.
  2. Run text_match_datasets.py. This uses the Whoosh indices to match MIDI files from clean_midi (which ostensibly may have reliable metadata) to entries in the different audio datasets. It also takes care to group audio files which are recordings of the same song. The results are written to results/text_matches.js.
  3. Run create_msd_cqts.py. This pre-computes constant-Q spectrograms for every entry in the Million Song Dataset, which saves time later on as we will need these for various steps throughout the process. They are written to data/msd/h5.
  4. Run align_text_matches.py. This uses dynamic time warping (specifically the approach proposed in [3]) to align each MIDI-audio pair found by metadata matching. The results are written to results/clean_midi_aligned, and include both the aligned MIDI files in results/clean_midi_aligned/mid and "diagnostics files" in results/clean_midi_aligned/h5. The diagnostics files contain information about whether each match is truly a match (an incorrect match can be caused e.g. by incorrect metadata or a bad transcription).
  5. Run split_training_data.py. This splits the matches into train, validation, development, and test collections which are used for evaluating each of the different matching approaches implemented in experiments.
  6. Run create_training_data.py. This inspects the results of align_text_matches.py to find good matches and generates training data for different matching approaches in a convenient format. It essentially produces saved constant-Q spectrograms of audio files, aligned MIDI files, unaligned MIDI files, and aligned MIDI piano rolls, in various folders in results.
  7. Run the experiments! Each subdirectory in the experiments directory corresponds to a different MIDI-audio matching technique. Each of these experiments at least contains a script called match_msd.py, which uses the matching technique to match each MIDI file in either the development or test set to the MSD and writes out the results. Most of the experiments have a script called precompute.py, which precomputes any necessary features/representation of entries in the development and test set. Finally, those experiments which are based on machine learning techniques also have a script parameter_search.py which trains any models necessary for performing the matching. In short, to run each of these experiments, run parameter_search.py if it exists, run precompute.py, and finally run match_msd.py. The results can be used to measure the effectiveness of each approach. There isn't a script which performs this analysis automatically, but there is a great deal of analysis in my thesis [1].
  8. To actually match the unique_midi collection to the Million Song Dataset, use the match.py script. For flexibility, this script takes a few command line arguments - first, a glob to MIDI files you want to match, and second, a path to where to write the results. To match the entire unique_midi dataset to the MSD, call it like so: python match.py ../data/unique_midi/mid/*/\*.mid output_path. This will produce (in output_path) one file for each MIDI file processed which lists potential matches in the MSD and the corresponding confidence scores.
  9. To assemble a collection of matched-and-aligned MIDI files, use the script assemble_aligned_matches.py. This will find all MIDI-audio matches produced by match.py which have a sufficiently high confidence score, re-align them, and write out the aligned MIDI file, along with the unaligned MIDI, MP3 file, and MSD H5, for convenience. In essence, this is how, at long last, each component of the Lakh MIDI dataset is produced.

References

  1. Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.
  2. Colin Raffel and Daniel P. W. Ellis. "Large-Scale Content-Based Matching of MIDI and Audio Files". Proceedings of the 16th International Society for Music Information Retrieval Conference, 2015.
  3. Colin Raffel and Daniel P. W. Ellis. "Optimizing DTW-Based Audio-to-MIDI Alignment and Matching". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.
  4. Colin Raffel and Daniel P. W. Ellis. "Pruning Subsequence Search with Attention-Based Embedding". Proceedings of the 41st IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.

More Repositories

1

pretty-midi

Utility functions for handling MIDI data in a nice/intuitive way.
Jupyter Notebook
849
star
2

mir_eval

Evaluation functions for music/audio information retrieval/signal processing algorithms.
Python
596
star
3

llm-seminar

Seminar on Large Language Models (COMP790-101 at UNC Chapel Hill, Fall 2022)
307
star
4

theano-tutorial

A collection of tutorials on neural networks, using Theano
Jupyter Notebook
222
star
5

mad

Code for "Online and Linear Time Attention by Enforcing Monotonic Alignments"
Jupyter Notebook
91
star
6

Lasagne-tutorial

Adding an ipynb tutorial to Lasagne
Python
53
star
7

mocha

Example implementation of Monotonic Chunkwise Attention.
Jupyter Notebook
49
star
8

ff-attention

Experiments using feedforward networks with attention
Python
47
star
9

jax-tutorial

A tutorial on JAX (https://github.com/google/jax/)
Jupyter Notebook
45
star
10

simple_spearmint

Spearmint, without the gum
Python
42
star
11

lstm_benchmarks

Benchmarking different LSTM libraries
Python
24
star
12

comp790-deep-learning-spring-2021

Course repository for the Spring COMP790 course "Deep Learning" at UNC
23
star
13

remixavier

Given a mixed song, remove components that you have
MATLAB
19
star
14

comp790-deep-learning-spring-2022

Course repository for the Spring 2022 COMP790 course "Deep Learning" at UNC
18
star
15

midi-ground-truth

Code for "Extracting Ground Truth Information from MIDI Files: A MIDIfesto"
Jupyter Notebook
18
star
16

craffel.github.io

Code for generating colinraffel.com and my CV
HTML
16
star
17

alignment-search

Parameter search for MIDI alignment
Python
15
star
18

csc2516-deep-learning-fall-2023

Course repository for the fall 2023 session of CSC2516 "Neural Networks and Deep Learning" at U of T
15
star
19

comp664-deep-learning-spring-2023

Course repository for the Spring 2023 COMP664 course "Deep Learning" at UNC
14
star
20

median-filter

A fast 1d median filter, for filtering the rows and columns of a matrix.
C
11
star
21

comp790-information-theory-fall-2021

Course repository for the Fall 2021 COMP790 course "Information Theory" at UNC
10
star
22

crucialpython

Code from the weekly Crucial Python jaminars. http://labrosa.ee.columbia.edu/crucialpython/
Python
10
star
23

align_midi

Code for MIDI-audio alignment
Python
9
star
24

djitw

Python just-in-time compiled DTW library
Python
8
star
25

thesis

My PhD thesis
TeX
8
star
26

pse

Code for embedding pairs of sequences in a fixed-dimensional space
Python
7
star
27

mir_evaluators

Evaluator scripts for mir_eval.
Python
7
star
28

lstm_problems

Suite of toy problems which can test whether a model can learn long-term dependencies.
Python
7
star
29

spikegram-coding

Code for computing a spikegram of an audio signal using matching pursuit
Python
6
star
30

dhs

Learning to convert sequences of feature vectors to downsampled sequences of hashes
Python
4
star
31

midi-dataset-ismir

Source for the ISMIR 2015 paper "Large-scale content-based matching of MIDI and audio files"
TeX
4
star
32

music-evolution

Code for reproducing the results in "Measuring the Evolution of Contemporary Western Popular Music", Serrร  et al.
Python
3
star
33

subsampling_in_expectation

Example code for "Training a Subsampling Mechanism in Expectation"
Jupyter Notebook
3
star
34

alignment-search-icassp2016

ICASSP 2016 paper, "Optimizing DTW-Based Audio-to-MIDI Alignment and Matching"
TeX
2
star
35

performerSynchronization

Code for measuring the synchronization of musicians, with many onset detection methods implemented.
Python
2
star
36

pruning_icassp2016

LaTeX source for ICASSP 2016 paper, "Pruning Subsequence Search with Attention-Based Embedding"
TeX
2
star
37

ismir2016extracting

LaTeX source for ISMIR 2016 paper "Extracting Ground-Truth Information from MIDI Files: A MIDIfesto"
TeX
1
star
38

dotfiles

cd && git clone --bare https://github.com/craffel/dotfiles.git .dotfiles && git --git-dir=.dotfiles --work-tree=. checkout
Shell
1
star
39

live-sets

Ableton Live set backup
1
star
40

rnntools

RNN layers for use with nntools
1
star
41

feature-inversion

Python
1
star
42

Snowball

"Download for a favor" content distribution web app
1
star
43

lattice-harp

Code for the Lattice Harp hybrid instrument/controller
Pure Data
1
star