• Corpus ID: 250699296

Differentiable Time-Frequency Scattering on GPU

  title={Differentiable Time-Frequency Scattering on GPU},
  author={John Muradeli and Cyrus Vahidi and Changhong Wang and Han Han and Vincent Lostanlen and Mathieu Lagrange and Georgy Fazekas},
Joint time–frequency scattering (JTFS) is a convolutional operator in the time–frequency domain which extracts spectrotemporal modulations at various rates and scales. It offers an idealized model of spectrotemporal receptive fields (STRF) in the primary auditory cortex, and thus may serve as a biological plausible surrogate for human perceptual judgments at the scale of isolated audio events. Yet, prior implementations of JTFS and STRF have remained out-side of the standard toolkit of… 

Figures and Tables from this paper

Learnable Front Ends Based on Temporal Modulation for Music Tagging

Experimental results show that the proposed front ends surpass state-of-the-art (SOTA) methods on the MagnaTagATune dataset in automatic music tagging, and they are also helpful for keyword spotting on speech commands.



Learning metrics on spectrotemporal modulations reveals the perception of musical instrument timbre.

A broad overview of former studies on musical timbre is provided to identify its relevant acoustic substrates according to biologically inspired models and observe that timbre has both generic and experiment-specific acoustic correlates.

Multiresolution spectrotemporal analysis of complex sounds.

A computational model of auditory analysis is described that is inspired by psychoacoustical and neurophysiological findings in early and central stages of the auditory system. The model provides a

Kymatio: Scattering Transforms in Python

The Kymatio software package is presented, an easy-to-use, high-performance Python implementation of the scattering transform in 1D, 2D, and 3D that is compatible with modern deep learning frameworks.

Extended playing techniques: the next milestone in musical instrument recognition

This work identifies and discusses three necessary conditions for significantly outperforming the traditional mel-frequency cepstral coefficient (MFCC) baseline: the addition of second-order scattering coefficients to account for amplitude modulation, the incorporation of long-range temporal dependencies, and metric learning using large-margin nearest neighbors (LMNN) to reduce intra-class variability.

Joint Time–Frequency Scattering

The joint time–frequency scattering transform is introduced, a time-shift invariant representation that characterizes the multiscale energy distribution of a signal in time and frequency that may be implemented as a deep convolutional neural network whose filters are not learned but calculated from wavelets.

Joint Scattering for Automatic Chick Call Recognition

An automatic system for chick call recognition using the joint time-frequency scattering (JTFS) transform improves the frame- and event-based macro F-measures by 10.2% and 11.7%, respectively, than that of a mel-frequency cepstral coefficients baseline.

Parametric Scattering Networks

Focusing on Morlet wavelets, it is proposed to learn the scales, orientations, and aspect ratios of the filters to produce problem-specific parameterizations of the scattering transform, and it is shown that learned versions of this scattering transform yield significant performance gains in small-sample classification settings over the standard scat-tering transform.

Time–frequency scattering accurately models auditory similarities between instrumental playing techniques

A machine listening model that relies on joint time–frequency scattering features to extract spectrotemporal modulations as acoustic features and minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm.

Playing Technique Recognition by Joint Time–Frequency Scattering

A recognition system based on the joint time–frequency scattering transform (jTFST) for pitch evolution-based playing techniques (PETs), a group of playing techniques with monotonic pitch changes over time, is proposed.

nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

A new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion, which allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk.