DALI: A Large Dataset of Synchronized Audio, Lyrics and notes, Automatically Created using Teacher-student Machine Learning Paradigm

@article{MeseguerBrocal2018DALIAL,
  title={DALI: A Large Dataset of Synchronized Audio, Lyrics and notes, Automatically Created using Teacher-student Machine Learning Paradigm},
  author={Gabriel Meseguer-Brocal and Alice Cohen-Hadria and Geoffroy Peeters},
  journal={ArXiv},
  year={2018},
  volume={abs/1906.10606}
}
The goal of this paper is twofold. First, we introduce DALI, a large and rich multimodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics at four levels of granularity. The second goal is to explain our methodology where dataset creation and learning models interact using a teacher-student machine learning paradigm that benefits each other. We start with a set of manual annotations of draft time-aligned lyrics and notes made by non-expert users of… 

Figures and Tables from this paper

Creating DALI, a Large Dataset of Synchronized Audio, Lyrics, and Notes
TLDR
The DALI dataset is presented, the developed tools to work the data are explained and the approach used to build it is detailed, establishing a loop whereby dataset creation and model learning interact, benefiting each other.
MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription
TLDR
A novel variant of the Multistreaming Time-Delay Neural Network (MTDNN) architecture, called MSTRE-Net, which processes the temporal information using multiple streams in parallel with varying resolutions keeping the network more compact, and thus with a faster inference and an improved recognition rate than having identical TDNN streams.
Multilingual lyrics-to-audio alignment
TLDR
This investigation presents the first (to the best of the authors' knowledge) attempt to create a language-independent lyrics-to-audio alignment system, based on a Recurrent Neural Network model trained with a Connectionist Temporal Classification algorithm.
PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription
TLDR
PDAugment is proposed, a data augmentation method that adjusts pitch and duration of speech at syllable level under the guidance of music scores to help ALT training, and shows that the ALT system equipped with this method outperforms previous state of theart systems by 5.9% and 18.1% WERs respectively.
On-Line Audio-to-Lyrics Alignment Based on a Reference Performance
TLDR
This work describes the first real-timecapable audio-to-lyrics alignment pipeline that is able to robustly track the lyrics of different languages, without additional language information.
Lyrics segmentation via bimodal text–audio representation
TLDR
A convolutional neural network (CNN)-based model that learns to segment the lyrics based on their repetitive text structure, and experiment with novel features to reveal different kinds of repetitions in the lyrics, for instance based on phonetical and syntactical properties.
Low Resource Audio-To-Lyrics Alignment from Polyphonic Music Recordings
TLDR
This study presents a novel method that performs audio-to-lyrics alignment with a low memory consumption footprint regardless of the duration of the music recording, and utilizes the lyrics alignment system to segment the music recordings into sentence-level chunks.
Acoustic Modeling for Automatic Lyrics-to-Audio Alignment
TLDR
This work proposes using additional speech and music-informed features and adapting the acoustic models trained on a large amount of solo singing vocals towards polyphonic music using a small amount of in-domain data to reduce the domain mismatch between training and testing data.
VOCANO: A note transcription framework for singing voice in polyphonic music
TLDR
VOCANO is presented, an open-source VOCAl NOte transcription framework built upon robust neural networks with multi-task and semi-supervised learning that outperforms the state of the arts on public benchmarks over a wide variety of evaluation metrics.
Semi-supervised learning using teacher-student models for vocal melody extraction
TLDR
The results show that the SSL method significantly increases the performance against supervised learning only and the improvement depends on the teacher-student models, the size of unlabeled data, the number of self-training iterations, and other training details.
...
1
2
3
4
...

References

SHOWING 1-10 OF 29 REFERENCES
FMA: A Dataset for Music Analysis
TLDR
The Free Music Archive is introduced, an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections, and some suitable MIR tasks are discussed.
Learning to Pinpoint Singing Voice from Weakly Labeled Examples
TLDR
This work investigates how well a singing voice detection system merely from song-wise annotations of vocal presence is trained, and can not only detect singing voice in a test signal with a temporal accuracy close to the state-of-the-art, but also localize the spectral bins with precision and recall close to a recent source separation method.
A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks
TLDR
This work introduces a method that uses Long Short-Term Memory Recurrent Neural Networks (LSTM-RNN) that outperforms the state-of-the-art baselines in terms of accuracy, while at the same time drastically reducing latency and increasing the time resolution of the detector.
Automatic Drum Transcription Using the Student-Teacher Learning Paradigm with Unlabeled Music Data
TLDR
This work addresses the challenge of insufficiently labeled data by exploring the possibility of utilizing unlabeled music data from online resources by training a student neural network using the labels generated from multiple teacher systems.
Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks
TLDR
A range of label-preserving audio transformations are applied and pitch shifting is found to be the most helpful augmentation method for music data augmentation, reaching the state of the art on two public datasets.
Singing voice identification and lyrics transcription for music information retrieval invited paper
  • A. Mesaros
  • Computer Science
    2013 7th Conference on Speech Technology and Human - Computer Dialogue (SpeD)
  • 2013
TLDR
The results show that classification of singing voices can be done robustly in polyphonic music when using source separation, and a system for automatic alignment of lyrics and audio is presented, with sufficient performance for facilitating applications such as automatic karaoke annotation or song browsing.
MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research
TLDR
The dataset MedleyDB, a dataset of annotated, royaltyfree multitrack recordings, is shown to be considerably more challenging than the current test sets used in the MIREX evaluation campaign, thus opening new research avenues in melody extraction research.
USING VOICE SEGMENTS TO IMPROVE ARTIST CLASSIFICATION OF MUSIC
TLDR
It is shown that for a small set of pop and rock songs, automatically-located singing segments form a more reliable basis for classification than using the entire album, suggesting that the singer’s voice is more stable across performances, compositions, and transformations to audio engineering techniques than the instrumental background.
Timbre and Melody Features for the Recognition of Vocal Activity and Instrumental Solos in Polyphonic Music
TLDR
This paper proposes the task of detecting instrumental solos in polyphonic music recordings, and the usage of a set of four audio features for vocal and instrumental activity detection, using a support vector machine hidden Markov model.
Singing voice detection with deep recurrent neural networks
TLDR
A new method for singing voice detection based on a Bidirectional Long Short-Term Memory (BLSTM) Recurrent Neural Network (RNN) that is able to take a past and future temporal context into account to decide on the presence/absence of singing voice.
...
1
2
3
...