Corpus ID: 208334929

mirdata: Software for Reproducible Usage of Datasets

  title={mirdata: Software for Reproducible Usage of Datasets},
  author={Rachel M. Bittner and Magdalena Fuentes and David Rubinstein and Andreas Jansson and Keunwoo Choi and Thor Kell},
There are a number of efforts in the MIR community towards increased reproducibility, such as creating more open datasets, publishing code, and the use of common software libraries, e.g. for evaluation. However, when it comes to datasets, there is usually little guarantee that researchers are using the exact same data in the same way, which among other issues, makes comparisons of different methods on the “same” datasets problematic. In this paper, we first show how (often unknown) differences… Expand
MusPy: A Toolkit for Symbolic Music Generation
This paper presents statistical analysis of the eleven datasets currently supported by MusPy and conducts a cross-dataset generalizability experiment, providing a map of domain overlap between various commonly used datasets and showing that some datasets contain more representative cross-genre samples than others. Expand
Creating DALI, a Large Dataset of Synchronized Audio, Lyrics, and Notes
The DALI dataset is presented, the developed tools to work the data are explained and the approach used to build it is detailed, establishing a loop whereby dataset creation and model learning interact, benefiting each other. Expand
Dagstuhl ChoirSet: A Multitrack Dataset for MIR Research on Choral Singing
Detailed insights are given into all stages of creating Dagstuhl ChoirSet (DCS), a multitrack dataset of a cappella choral music designed to support MIR research on choral singing. Expand
OrchideaSOL: a dataset of extended instrumental techniques for computer-aided orchestration
OrchideaSOL is a reduced and modified subset of Studio On Line, or SOL for short, a dataset developed at Ircam between 1996 and 1998 and designed to be used as default dataset for the OrchideA framework for target-based computer-aided orchestration. Expand
Audio-Based Music Structure Analysis: Current Trends, Open Challenges, and Applications
It could be beneficial for audio-based music structural analysis systems to be application-dependent in order to increase their usability and highlight the subjectivity, ambiguity, and hierarchical nature of musical structure as essential factors to address in future work. Expand


JAMS: A JSON Annotated Music Specification for Reproducible MIR Research
JAMS, a JSON-based music annotation format capable of addressing the evolving research requirements of the community, is proposed, designed to support existing data while encouraging the transition to more consistent, comprehensive, well-documented annotations. Expand
MIR_EVAL: A Transparent Implementation of Common MIR Metrics
Central to the field of MIR research is the evaluation of algorithms used to extract information from music data. We present mir_eval, an open source software library which provides a transparent andExpand
DataDeps.jl: Repeatable Data Setup for Replicable Data Science
DataDeps.jl simplifies extending research software by automatically managing the dependencies and makes it easier to run another author's code, thus enhancing the reproducibility of data science research. Expand
MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research
The dataset MedleyDB, a dataset of annotated, royaltyfree multitrack recordings, is shown to be considerably more challenging than the current test sets used in the MIREX evaluation campaign, thus opening new research avenues in melody extraction research. Expand
Unbiased look at dataset bias
A comparison study using a set of popular datasets, evaluated based on a number of criteria including: relative data bias, cross-dataset generalization, effects of closed-world assumption, and sample value is presented. Expand
Evaluating Hierarchical Structure in Music Annotations
An evaluation metric is derived which can compare hierarchical annotations holistically across multiple levels and is investigated to investigate inter-annotator agreement on the multilevel annotations of two different music corpora, investigate the influence of acoustic properties on hierarchical annotations, and evaluate existing hierarchical segmentation algorithms against the distribution of inter- annotator agreement. Expand
TensorFlow: A system for large-scale machine learning
The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated. Expand
The Million Song Dataset
The Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks, is introduced and positive results on year prediction are shown, and the future development of the dataset is discussed. Expand
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers. Expand
Open-Source Practices for Music Signal Processing Research: Recommendations for Transparent, Sustainable, and Reproducible Audio Research
Because of an increased abundance of methods, the proliferation of software toolkits, the explosion of machine learning, and a focus shift toward more realistic problem settings, modern research systems are substantially more complex than their predecessors. Expand