Modeling Beats and Downbeats with a Time-Frequency Transformer

@article{Hung2022ModelingBA,
  title={Modeling Beats and Downbeats with a Time-Frequency Transformer},
  author={Yun-Ning Hung and Ju-Chiang Wang and Xuchen Song and Weiyi Lu and Minz Won},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2022},
  pages={401-405}
}
  • Yun-Ning HungJu-Chiang Wang Minz Won
  • Published 23 May 2022
  • Computer Science
  • ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral- Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model… 

Figures and Tables from this paper

Beat Transformer: Demixed Beat and Downbeat Tracking with Dilated Self-Attention

This work proposes Beat Transformer, a novel Transformer encoder architecture for joint beat and downbeat tracking that adopts a novel dilated self-attention mechanism, which achieves powerful hierarchical modelling with only linear complexity.

To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions

A multi-task deep learning framework to model these structural semantic labels directly from audio by estimating "verseness," "chorusness," and so forth, as a function of time is introduced.

An Analysis Method for Metric-Level Switching in Beat Tracking

This letter proposes a new performance analysis method, called annotation coverage ratio (ACR), that accounts for a variety of possible metric-level switching behaviors of beat trackers and shows the usefulness of ACR when being utilized alongside existing metrics, and discusses the new insights that can be gained.

Jointist: Joint Learning for Multi-instrument Transcription and Its Applications

Jointist is an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip and the symbolic representation provided by the transcription model turned out to be helpful to spectrograms in solving downbeat detection, chord recognition, and key estimation.

Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

Jointist is introduced, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip and achieves state-of-the-art performance on popular music, outperforming existing multi- instrument transcription models such as MT3.

References

SHOWING 1-10 OF 38 REFERENCES

SpecTNT: a Time-Frequency Transformer for Music Audio

A novel variant of the Transformer-inTransformer (TNT) architecture to model both spectral and temporal sequences of an input time-frequency representation, which demonstrates state-of-the-art performance in music tagging and vocal melody extraction, and shows competitive performance for chord recognition.

Analysis of Common Design Choices in Deep Learning Systems for Downbeat Tracking

A systematic investigation of the impact of largely adopted variants of convolutional-recurrent networks on downbeat tracking, and finds that temporal granularity has a significant impact on performance.

Robust Downbeat Tracking Using an Ensemble of Convolutional Networks

A novel state-of-the-art system for automatic downbeat tracking from music signals which takes advantage of the assumed metrical continuity of a song with significant increase in performance compared to the second-best system.

Joint Estimation of Chords and Downbeats From an Audio Signal

The results show that the downbeat positions of a music piece can be estimated in terms of its harmonic structure and that conversely the chord progression estimation benefits from considering the interaction between the metric and the harmonic structures.

A Bi-Directional Transformer for Musical Chord Recognition

It turns out that the proposed bi-directional Transformer for chord recognition was able to divide segments of chords by utilizing adaptive receptive field of the attention mechanism, and it was observed that the model was ability to effectively capture long-term dependencies, making use of essential information regardless of distance.

A Music Structure Informed Downbeat Tracking System Using Skip-chain Conditional Random Fields and Deep Learning

This work introduces a skip-chain conditional random field language model for downbeat tracking designed to include section information in an unified and flexible framework and shows that incorporating structure information in the language model leads to more consistent and more robust downbeat estimations.

Data-Driven Harmonic Filters for Audio Representation Learning

Experimental results show that a simple convolutional neural network back-end with the proposed front-end outperforms state-of-the-art baseline methods in automatic music tagging, keyword spotting, and sound event tagging tasks.

Joint Beat and Downbeat Tracking with Recurrent Neural Networks

A recurrent neural network operating directly on magnitude spectrograms is used to model the metrical structure of the audio signals at multiple levels and provides an output feature that clearly distinguishes between beats and downbeats.

Harmony Transformer: Incorporating Chord Segmentation into Harmony Recognition

The Harmony Transformer is proposed, a multi-task music harmony analysis model aiming to improve chord recognition through incorporating chord segmentation into the recognition process using end-to-end sequence learning.

Deconstruct, Analyse, Reconstruct: How to improve Tempo, Beat, and Downbeat Estimation

A novel multi-task approach for the simultaneous estimation of tempo, beat, and downbeat is devised, which seeks to embed more explicit musical knowledge into the design decisions in building the network.