A Multi-Head Relevance Weighting Framework for Learning Raw Waveform Audio Representations

  title={A Multi-Head Relevance Weighting Framework for Learning Raw Waveform Audio Representations},
  author={Debottam Dutta and Purvi Agrawal and Sriram Ganapathy},
  journal={2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
In this work, we propose a multi-head relevance weighting framework to learn audio representations from raw waveforms. The audio waveform, split into windows of short-duration, are processed with a 1-D convolutional layer of cosine modulated Gaussian filters acting as a learnable filterbank. The key novelty of the proposed framework is the introduction of multi-head relevance on the learnt filterbank representations. Each head of the relevance network is modelled as a separate sub-network… 

Figures and Tables from this paper

Svadhyaya system for the Second Diagnosing COVID-19 using Acoustics Challenge 2021

This report describes the system used for detecting COVID-19 positives using three different acoustic modalities, namely speech, breathing, and cough in the second DiCOVA challenge. The proposed



Interpretable Representation Learning for Speech and Audio Signals Based on Relevance Weighting

A relevance weighting scheme is proposed that allows the interpretation of the speech representations during the forward propagation of the model itself and improves the performance significantly in speech recognition and sound classification experiments.

Unsupervised Raw Waveform Representation Learning for ASR

The learned representations from the proposed framework provide significant improvements in ASR results over the baseline filterbank features and other robust front-ends and employ the learned representations (second layer outputs) in a speech recognition task.

Speech acoustic modeling from raw multichannel waveforms

A convolutional neural network - deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, and learns a similar feature representation through supervised training and outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions.

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

  • Yi LuoN. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.

Acoustic modeling with deep neural networks using raw time signal for LVCSR

Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, the DNN is trained on a combination of multiple short-term features, illustrating how the Dnn can learn from the little differences between MFCC, PLP and Gammatone features.

Speaker Recognition from Raw Waveform with SincNet

This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters.

Acoustic Scene Classification Using Deep Residual Networks with Late Fusion of Separated High and Low Frequency Paths

  • M. McDonnellWei Gao
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
The performance of the models are significantly enhanced by the use of log-mel deltas, and overall the approach is capable of training strong single models, without use of any supplementary data from outside the official challenge dataset, with excellent generalization to unknown devices.

LEAF: A Learnable Frontend for Audio Classification

This work introduces a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks, and outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.

Modulation Filter Learning Using Deep Variational Networks for Robust Speech Recognition

The proposed modulation filter learning framework shows significant improvements over the baseline features as well as various other noise robust front-ends and is shown to be of considerable benefit for semi-supervised automatic speech recognition applications.

wav2vec: Unsupervised Pre-training for Speech Recognition

Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.