• Corpus ID: 236924386

MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription

@inproceedings{Demirel2021MSTRENetMA,
  title={MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription},
  author={Emir Demirel and Sven Ahlb{\"a}ck and Simon Dixon},
  booktitle={International Society for Music Information Retrieval Conference},
  year={2021}
}
This paper makes several contributions to automatic lyrics transcription (ALT) research. Our main contribution is a novel variant of the Multistreaming Time-Delay Neural Network (MTDNN) architecture, called MSTRE-Net, which processes the temporal information using multiple streams in parallel with varying resolutions keeping the network more compact, and thus with a faster inference and an improved recognition rate than having identical TDNN streams. In addition, two novel preprocessing steps… 

Figures and Tables from this paper

MM-ALT: A Multimodal Automatic Lyric Transcription System

Experiments show the effectiveness of the proposed MM-ALT system, especially in terms of noise robustness, and the Residual Cross Attention (RCA) mechanism to fuse data from the three modalities (i.e., audio, video, and IMU).

Towards Transfer Learning of wav2vec 2.0 for Automatic Lyric Transcription

This work proposes a transfer-learning-based ALT solution that takes advantage of the similarities between speech and singing by adapting wav2vec 2.0, an SSL ASR model, to the singing domain and enhances the performance by extending the original CTC model to a hybrid CTC/attention model.

PoLyScribers: Joint Training of Vocal Extractor and Lyrics Transcriber for Polyphonic Music

Novel end-to-end joint-training framework, that is called PoLyScribers, to jointly optimize the vocal extractor front-end and lyrics transcriber backend for lyrics transcription in polyphonic music achieves substantial improvements over the existing approaches on publicly available test datasets.

PoLyScriber: Integrated Training of Extractor and Lyrics Transcriber for Polyphonic Music

This work proposes a novel end-to-end integrated training framework, that it calls PoLyScriber, to globally optimize the vocal extractor front-end and lyrics transcriber backend for lyrics transcription in polyphonic music.

PoLyScriber: Integrated Training of Extractor and Lyrics Transcriber for Lyrics Transcription in Polyphonic Music

This work proposes a novel end-to-end integrated training framework, that it calls PoLyScriber, to globally optimize the vocal extractor front-end and lyrics transcriber backend for lyrics transcription in polyphonic music.

Transcription of Mandarin singing based on human-computer interaction

A dataset is provided that can be used for Mandarin lyrics transcription, and a transcription model is built on this dataset that can address some deficiencies of the existing models, and achieves promising results on the dataset.

Self-Transriber: Few-shot Lyrics Transcription with Self-training

This work proposes the first semi-supervised lyrics transcription paradigm, Self-Transcriber, by leveraging on unlabeled data using self-training with noisy student augmentation, and closes the gap between supervised and semi- supervised learning as well as opens doors for few-shot learning of lyrics transcription.

VocalFlows: A co-creative AI to suggest vocal flows

  • 2022

References

SHOWING 1-10 OF 38 REFERENCES

Automatic Lyric Transcription from Karaoke Vocal Tracks: Resources and a Baseline System

How transcripts and alignments have been recovered from Karaoke prompts and timings; how suitable training, development and test sets have been defined with varying degrees of accent variability; and how language models have been developed using lyric data from the LyricWikia website are described.

DALI: A Large Dataset of Synchronized Audio, Lyrics and notes, Automatically Created using Teacher-student Machine Learning Paradigm

DALI is introduced, a large and rich multimodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics at four levels of granularity and it is shown that this allows to progressively improve the performances of the SVD and get better audio-matching and alignment.

Automatic Lyrics Transcription using Dilated Convolutional Neural Networks with Self-Attention

This paper trained convolutional time-delay neural networks with self-attention on monophonic karaoke recordings using a sequence classification objective for building the acoustic model and achieves notable improvement to the state-of-the-art in ALT and provides a new baseline for the task.

Multilingual lyrics-to-audio alignment

This investigation presents the first (to the best of the authors' knowledge) attempt to create a language-independent lyrics-to-audio alignment system, based on a Recurrent Neural Network model trained with a Connectionist Temporal Classification algorithm.

Low Resource Audio-To-Lyrics Alignment from Polyphonic Music Recordings

This study presents a novel method that performs audio-to-lyrics alignment with a low memory consumption footprint regardless of the duration of the music recording, and utilizes the lyrics alignment system to segment the music recordings into sentence-level chunks.

Automatic Lyrics Transcription in Polyphonic Music: Does Background Music Help?

This work proposes to learn music genre-specific characteristics to train polyphonic acoustic models, and explicitly model the characteristics of music, instead of trying to remove the background music as noise.

Semi-supervised Lyrics and Solo-singing Alignment

A large-scale solo-singing and lyrics aligned corpus can be derived with the proposed method, which will be beneficial for music and singing voice related research.

ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition

In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple

Integrating Additional Chord Information Into HMM-Based Lyrics-to-Audio Alignment

The new accompaniment interface Song Prompter is introduced, which uses the automatically aligned lyrics to guide musicians through a song, and demonstrates that the automatic alignment is accurate enough to be used in a musical performance.

Computational Pronunciation Analysis in Sung Utterances

A novel computational analysis on the pronunciation variances in sung utterances is applied and a new pronunciation model adapted for singing is proposed, which performs better than the standard speech dictionary in all settings.