Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System

  title={Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System},
  author={Gerardo Roa Dabike and J. Barker},
Automatic sung speech recognition is a relatively understudied topic that has been held back by a lack of large and freely available datasets. This has recently changed thanks to the release of the DAMP Sing! dataset, a 1100 hour karaoke dataset originating from the social music-making company, Smule. This paper presents work undertaken to define an easily replicable, automatic speech recognition benchmark for this data. In particular, we describe how transcripts and alignments have been… Expand
MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription
A novel variant of the Multistreaming Time-Delay Neural Network (MTDNN) architecture, called MSTRE-Net, which processes the temporal information using multiple streams in parallel with varying resolutions keeping the network more compact, and thus with a faster inference and an improved recognition rate than having identical TDNN streams. Expand
Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention
This paper trained convolutional time-delay neural networks with self-attention on monophonic karaoke recordings using a sequence classification objective for building the acoustic model and achieves notable improvement to the state-of-the-art in ALT and provides a new baseline for the task. Expand
PDAugment: Data Augmentation by Pitch and Duration Adjustments for Automatic Lyrics Transcription
  • Chen Zhang, Jiaxing Yu, +4 authors Kejun Zhang
  • Computer Science, Engineering
  • ArXiv
  • 2021
Automatic lyrics transcription (ALT), which can be regarded as automatic speech recognition (ASR) on singing voice, is an interesting and practical topic in academia and industry. ALT has not beenExpand
Computational Pronunciation Analysis in Sung Utterances
A novel computational analysis on the pronunciation variances in sung utterances is applied and a new pronunciation model adapted for singing is proposed, which performs better than the standard speech dictionary in all settings. Expand
The use of Voice Source Features for Sung Speech Recognition
  • Gerardo Roa Dabike, J. Barker
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
Voicing quality characteristics did not improve recognition performance although analysis suggests that they do contribute to an improved discrimination between voiced/unvoiced phoneme pairs. Expand
Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?
This work compares several automatic speech recognition pipelines for the application of lyrics transcription, and presents the lyrics alignment and transcription performance of music-informed acoustic models for the best-performing pipeline. Expand
Lyrics Information Processing: Analysis, Generation, and Applications
In this paper we propose lyrics information processing (LIP) as a research field for technologies focusing on lyrics text, which has both linguistic and musical characteristics. This field couldExpand


Bootstrapping a System for Phoneme Recognition and Keyword Spotting in Unaccompanied Singing
This paper uses the DAMP data set, which contains a large number of recordings of amateur singing in good quality, to solve the problem of speech recognition in singing using an acoustic model trained on speech. Expand
Transcribing Lyrics from Commercial Song Audio: the First Step Towards Singing Content Processing
This paper collected music-removed version of English songs directly from commercial singing content and reported an initial attempt towards recognition of lyrics from song audio. Expand
The CAPIO 2017 Conversational Speech Recognition System
This paper shows how the state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation set is achieved, and proposes an acoustic model adaptation scheme that simply averages the parameters of a seed neural network acoustic model and its adapted version. Expand
Speech analysis of sung-speech and lyric recognition in monophonic singing
Japanese lyric recognition in monophonic singing that contains no musical instruments is considered and a remarkable improvement on lyric recognition is obtained in comparison with the baseline system for spontaneous speech recognition. Expand
Librispeech: An ASR corpus based on public domain audio books
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself. Expand
TED-LIUM: an Automatic Speech Recognition dedicated corpus
The content of the corpus, how the data was collected and processed, how it will be publicly available and how an ASR system was built using this data leading to a WER score of 17.4%. Expand
The Kaldi Speech Recognition Toolkit
The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems. Expand
The Design for the Wall Street Journal-based CSR Corpus
This paper presents the motivating goals, acoustic data design, text processing steps, lexicons, and testing paradigms incorporated into the multi-faceted WSJ CSR Corpus, a corpus containing significant quantities of both speech data and text data. Expand
Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI
A method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training is described, using the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. Expand
Recognition of phonemes and words in singing
The influence of n-gram language models in the recognition of sung phonemes and words is studied and the use of the recognition results in a query-by-singing application is studied. Expand