Creating DALI, a Large Dataset of Synchronized Audio, Lyrics, and Notes

  title={Creating DALI, a Large Dataset of Synchronized Audio, Lyrics, and Notes},
  author={Gabriel Meseguer-Brocal and Alice Cohen-Hadria and Geoffroy Peeters},
  journal={Trans. Int. Soc. Music. Inf. Retr.},
The DALI dataset is a large dataset of time-aligned symbolic vocal melody notations (notes) and lyrics at four levels of granularity. DALI contains 5358 songs in its first version and 7756 for the second one. In this article, we present the dataset, explain the developed tools to work the data and detail the approach used to build it. Our method is motivated by active learning and the teacher-student paradigm. We establish a loop whereby dataset creation and model learning interact, benefiting… 
Filosax: A Dataset of Annotated Jazz Saxophone Recordings
The criteria used for choosing and sourcing the repertoire, the recording process and the semi-automatic transcription pipeline are outlined, and the use of the dataset to analyse musical phenomena such as swing timing and dynamics of typical musical figures is demonstrated.
vocadito: A dataset of solo vocals with f0, note, and lyric annotations
This work presents a small dataset entitled vocadito, consisting of 40 short excerpts of monophonic singing, sung in 7 different languages by singers with varying of levels of training, and recorded on a variety of devices.
Improving Lyrics Alignment through Joint Pitch Detection
This paper proposes a multi-task learning approach for lyrics alignment that incorporates pitch and thus can make use of a new source of highly accurate temporal information and shows that the accuracy of the alignment result is improved by this approach.
The Words Remain the Same: Cover Detection with Lyrics Transcription
This work proposes a novel approach leveraging lyrics without requiring access to full texts though the use of lyrics recognition on audio, which relies on the fusion of a singing voice recognition framework and a more classic tonal-based cover detection method.
Content based singing voice source separation via strong conditioning using aligned phonemes
It is shown that phoneme conditioning can be successfully applied to improve singing voice source separation and explored strong conditioning using the aligned phonemes.
User-centered evaluation of lyrics-to-audio alignment
The perceptual robustness of the most commonly used metric to evaluate the lyrics-to-audio alignment evaluation is called into question, and the perception of audio and lyrics synchrony through two realistic experimental settings inspired from karaoke is investigated.
Zero-shot Singing Technique Conversion
Modifications to the neural network framework, AutoVC, for the task of singing technique conversion are proposed, utilising a pretrained singing technique encoder which extracts technique information, upon which a decoder is conditioned during training.
Mining in Educational Data: Review and Future Directions
This review has the objective of examining the way data mining was handled by researchers in the past and the most recent trends on data mining in educational research, as well as to evaluate the likelihood of employing machine learning in the field of education.
Data Cleansing with Contrastive Learning for Vocal Note Event Annotations
This work proposes a novel data cleansing model for time-varying, structured labels which exploits the local structure of the labels, and demonstrates its usefulness for vocal note event annotations in music.


DALI: A Large Dataset of Synchronized Audio, Lyrics and notes, Automatically Created using Teacher-student Machine Learning Paradigm
DALI is introduced, a large and rich multimodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics at four levels of granularity and it is shown that this allows to progressively improve the performances of the SVD and get better audio-matching and alignment.
FMA: A Dataset for Music Analysis
The Free Music Archive is introduced, an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections, and some suitable MIR tasks are discussed.
Word level lyrics-audio synchronization using separated vocals
This paper presents an approach for lyric-audio alignment by comparing synthesized speech with a vocal track removed from an instrument mixture using source separation, taking a hierarchical approach to solve the problem.
End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-character Recognition Model
  • D. Stoller, S. Durand, S. Ewert
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
A novel system based on a modified Wave-U-Net architecture, which predicts character probabilities directly from raw audio using learnt multi-scale representations of the various signal components, outperforms the state-of-the-art by an order of magnitude.
Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment
The new accompaniment interface Song Prompter is introduced, which uses the automatically aligned lyrics to guide musicians through a song, and demonstrates that the automatic alignment is accurate enough to be used in a musical performance.
LyricAlly: Automatic Synchronization of Textual Lyrics to Acoustic Music Signals
LyricAlly is presented, a prototype that automatically aligns acoustic musical signals with their corresponding textual lyrics, in a manner similar to manually-aligned karaoke, using an appropriate pairing of audio and text processing.
MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research
The dataset MedleyDB, a dataset of annotated, royaltyfree multitrack recordings, is shown to be considerably more challenging than the current test sets used in the MIREX evaluation campaign, thus opening new research avenues in melody extraction research.
Automatic lyrics alignment for Cantonese popular music
The goal is to automate the process of lyrics alignment, a procedure which, to date, is still handled manually in the Cantonese popular song (Cantopop) industry, and uses a dynamic time warping algorithm to align the lyrics.
The NES Music Database: A multi-instrumental dataset with expressive performance attributes
The Nintendo Entertainment System Music Database is introduced, a large corpus allowing for separate examination of the tasks of composition and performance and a tool that renders generated compositions as NES-style audio by emulating the device's audio processor.
Acoustic Modeling for Automatic Lyrics-to-Audio Alignment
This work proposes using additional speech and music-informed features and adapting the acoustic models trained on a large amount of solo singing vocals towards polyphonic music using a small amount of in-domain data to reduce the domain mismatch between training and testing data.