Exploring Filterbank Learning for Keyword Spotting

  title={Exploring Filterbank Learning for Keyword Spotting},
  author={Iv'an L'opez-Espejo and Z. Tan and Jesper H{\o}jvang Jensen},
  journal={2020 28th European Signal Processing Conference (EUSIPCO)},
Despite their great performance over the years, handcrafted speech features are not necessarily optimal for any particular speech application. Consequently, with greater or lesser success, optimal filterbank learning has been studied for different speech processing tasks. In this paper, we fill in a gap by exploring filterbank learning for keyword spotting (KWS). Two approaches are examined: filterbank matrix learning in the power spectral domain and parameter learning of a psychoacoustically… 
Deep Spoken Keyword Spotting: An Overview
The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.


A deep neural network integrated with filterbank learning for speech recognition
Experimental results show that the frame-level transformation of filterbank layer constrains flexibility and promotes learning efficiency in acoustic modeling, and is incorporated to the bottom of DNN.
Exploring spectro-temporal features in end-to-end convolutional neural networks
By rearranging the order of operations in computing filter bank features, features can be integrated over smaller time scales while simultaneously providing better frequency resolution, and this paper makes all feature implementations available online through open-source repositories.
Keyword Spotting for Hearing Assistive Devices Robust to External Speakers
A state-of-the-art deep residual network for small-footprint KWS is regarded as a basis to build upon and extended to jointly perform KWS and users' own-voice/external speaker detection with a negligible increase in the number of parameters.
End-to-End Speech Recognition From the Raw Waveform
End-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions and shows a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel- filterbanks.
Learning the speech front-end with raw waveform CLDNNs
It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech.
Discriminative frequency filter banks learning with neural networks
  • Teng Zhang, Ji Wu
  • Computer Science
    EURASIP J. Audio Speech Music. Process.
  • 2019
Experiments on audio source separation and audio scene classification tasks show performance improvements of the proposed filter banks when compared with traditional fixed-parameter triangular or gaussian filters on Mel scale.
Interpretable Convolutional Filters with SincNet
This paper proposes SincNet, a novel Convolutional Neural Network that encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions, and shows that the proposed architecture converges faster, performs better, and is more interpretable than standard CNNs.
Small-Footprint Keyword Spotting on Raw Audio Data with Sinc-Convolutions
The end-to-end architecture extracts spectral features using parametrized Sinc-convolutions and achieves the competitive accuracy of 96.4% on Google’s Speech Commands test set with only 62k parameters.
DNN Filter Bank Cepstral Coefficients for Spoofing Detection
A new filter bank-based cepstral feature, deep neural network (DNN) filter bank cepStral coefficients, to distinguish between natural and spoofed speech, and the experimental results show that the Gaussian mixture model maximum-likelihood classifier trained by the new feature performs better than the state-of-the-art linear frequency triangle filter bankcepstrals-based classifier, especially on detecting unknown attacks.
Deep Residual Learning for Small-Footprint Keyword Spotting
  • Raphael Tang, Jimmy J. Lin
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
This work explores the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as a benchmark and establishes an open-source state-of-the-art reference to support the development of future speech-based interfaces.