CNN Based Query by Example Spoken Term Detection
@inproceedings{Ram2018CNNBQ, title={CNN Based Query by Example Spoken Term Detection}, author={Dhananjay Ram and Lesly Miculicich and Herv{\'e} Bourlard}, booktitle={Interspeech}, year={2018} }
In this work, we address the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State of the art solutions usually rely on dynamic time warping (DTW) based template matching. In contrast, we propose here to tackle the problem as binary classification of images. Similar to the DTW approach, we rely on deep neural network (DNN) based posterior probabilities as feature vectors. The posteriors from a spoken query and a test utterance are used to compute frame…
15 Citations
Neural Network Based End-to-End Query by Example Spoken Term Detection
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2020
This article shows that the CNN based matching outperforms DTW based matching using bottleneck features as well and proposes to integrate these two stages in a fully neural network based end-to-end learning framework to enable joint optimization of those two stages simultaneously.
Language Independent Query by Example Spoken Term Detection
- Computer Science
- 2019
This thesis exploits the low-dimensional subspace structure of speech signal, resulting from the constrained human speech production process, to improve over the state-of-the-art to generate better phone or phonological posterior features, and to improve the matching algorithm.
Phonetic subspace features for improved query by example spoken term detection
- Computer ScienceSpeech Commun.
- 2018
Multilingual Bottleneck Features for Query by Example Spoken Term Detection
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
This work uses multitask learning to train the multilingual networks which perform significantly better than the concatenated monolingual features and proposes to employ residual networks (ResNet) to estimate the bottleneck features and show significant improvements over the corresponding feed forward network based features.
Generalized Keyword Spotting using ASR embeddings
- Computer Science, EducationINTERSPEECH
- 2022
This work proposes to use the text transcripts from an Automatic Speech Recognition (ASR) system alongside triplets for KWS training, which achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset.
Kernel based Matching and a Novel training approach for CNN-based QbE-STD
- Computer Science2020 International Conference on Signal Processing and Communications (SPCOM)
- 2020
Kernel based matching is proposed by considering histogram intersection kernel (HIK) as a matching metric for QbE-STD by training a CNN-based classifier using size-normalized images instead of splitting them into subimages as in [6].
CNN-Based Spoken Term Detection and Localization without Dynamic Programming
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
The proposed algorithm infers whether a term was uttered within a given speech signal or not by predicting the word embeddings of various parts of the speech signal and comparing them to the word embeddedding of the desired term.
Two-stage spoken term detection system for under-resourced languages
- Computer ScienceIET Signal Process.
- 2020
A two-stage STD system is proposed, which combines the ASR-based phoneme sequence matching in the first stage and feature sequence template matching of selected locations in the second stage, which helps to reduce the false-positives in case of longer query words.
Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection
- Computer ScienceArXiv
- 2018
This paper proposes some approaches to better cluster the audio embeddings such that those corresponding to the same linguistic unit can be more compactly distributed, inspired by Siamese networks.
DyConvMixer: Dynamic Convolution Mixer Architecture for Open-Vocabulary Keyword Spotting
- Computer ScienceINTERSPEECH
- 2022
The DyConvMixer model is proposed, which is an efficient and effective model that has less than 200K parameters and uses less than 11M MACs and shows competitive results on the publicly available Hey-Snips and Hey-Snapdragon datasets.
References
SHOWING 1-10 OF 27 REFERENCES
High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation
- Computer Science2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2014
Experimental results show that remarkable performance improvements can be achieved by using multiple examples per query and through the late (score-level) fusion of different subsystems, each based on a different set of phone posteriors.
Sparse Subspace Modeling for Query by Example Spoken Term Detection
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2018
Three different QbE-STD systems based on sparse model recovery are investigated, one of which proposes to regularize the template matching local distances using sparse reconstruction errors, and the other two aim at merging template matching and sparsity-based approaches to further improve the performance.
Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection
- Computer ScienceINTERSPEECH
- 2016
A framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN) and performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese).
Model-Based Unsupervised Spoken Term Detection with Spoken Queries
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2013
A set of model-based approaches for unsupervised spoken term detection (STD) with spoken queries that requires neither speech recognition nor annotated data and the usefulness of ASMs for STD in zero-resource settings and the potential of an instantly responding STD system using ASM indexing are demonstrated.
Query by Example Search on Speech at Mediaeval 2015
- Computer ScienceMediaEval
- 2014
The task has been designed to get as close as possible to a practical use case scenario, in which a user would like to retrieve, using speech, utterances containing a given word or short sentence, including those with limited inflectional variations of words, some filler content and/or word re-orderings.
Subspace Regularized Dynamic Time Warping for Spoken Query Detection
- Computer Science
- 2017
Local DTW scores are integrated with the sparse reconstruction scores to obtain a subspace regularized distance matrix for DTW and the proposed method yields a substantial performance gain over the baseline system.
Query-by-example spoken term detection using phonetic posteriorgram templates
- Computer Science2009 IEEE Workshop on Automatic Speech Recognition & Understanding
- 2009
A query-by-example approach to spoken term detection in audio files designed for low-resource situations in which limited or no in-domain training material is available and accurate word-based speech recognition capability is unavailable.
Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams
- Computer Science, Economics2009 IEEE Workshop on Automatic Speech Recognition & Understanding
- 2009
An unsupervised learning framework is presented to address the problem of detecting spoken keywords by using segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances and obtaining the keyword detection result.
MediaEval 2013 Spoken Web Search Task: System Performance Measures
- Computer Science
- 2013
How to measure system performance in the Spoken Web Search (SWS) task at MediaEval 2013 is discussed, based on different sources, including the NIST 2006 Spoken Term detection (STD) Evaluation Plan.
Resource configurable spoken query detection using Deep Boltzmann Machines
- Computer Science2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2012
A spoken query detection method based on posteriorgrams generated from Deep Boltzmann Machines (DBMs) that can be deployed in both semi-supervised and unsupervised training scenarios.