CNN Based Query by Example Spoken Term Detection

  title={CNN Based Query by Example Spoken Term Detection},
  author={Dhananjay Ram and Lesly Miculicich and Herv{\'e} Bourlard},
In this work, we address the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State of the art solutions usually rely on dynamic time warping (DTW) based template matching. In contrast, we propose here to tackle the problem as binary classification of images. Similar to the DTW approach, we rely on deep neural network (DNN) based posterior probabilities as feature vectors. The posteriors from a spoken query and a test utterance are used to compute frame… 

Figures and Tables from this paper

Neural Network Based End-to-End Query by Example Spoken Term Detection

This article shows that the CNN based matching outperforms DTW based matching using bottleneck features as well and proposes to integrate these two stages in a fully neural network based end-to-end learning framework to enable joint optimization of those two stages simultaneously.

Language Independent Query by Example Spoken Term Detection

This thesis exploits the low-dimensional subspace structure of speech signal, resulting from the constrained human speech production process, to improve over the state-of-the-art to generate better phone or phonological posterior features, and to improve the matching algorithm.

Multilingual Bottleneck Features for Query by Example Spoken Term Detection

This work uses multitask learning to train the multilingual networks which perform significantly better than the concatenated monolingual features and proposes to employ residual networks (ResNet) to estimate the bottleneck features and show significant improvements over the corresponding feed forward network based features.

Generalized Keyword Spotting using ASR embeddings

This work proposes to use the text transcripts from an Automatic Speech Recognition (ASR) system alongside triplets for KWS training, which achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset.

Kernel based Matching and a Novel training approach for CNN-based QbE-STD

Kernel based matching is proposed by considering histogram intersection kernel (HIK) as a matching metric for QbE-STD by training a CNN-based classifier using size-normalized images instead of splitting them into subimages as in [6].

CNN-Based Spoken Term Detection and Localization without Dynamic Programming

  • T. FuchsYael SegalJoseph Keshet
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
The proposed algorithm infers whether a term was uttered within a given speech signal or not by predicting the word embeddings of various parts of the speech signal and comparing them to the word embeddedding of the desired term.

Two-stage spoken term detection system for under-resourced languages

A two-stage STD system is proposed, which combines the ASR-based phoneme sequence matching in the first stage and feature sequence template matching of selected locations in the second stage, which helps to reduce the false-positives in case of longer query words.

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

This paper proposes some approaches to better cluster the audio embeddings such that those corresponding to the same linguistic unit can be more compactly distributed, inspired by Siamese networks.

DyConvMixer: Dynamic Convolution Mixer Architecture for Open-Vocabulary Keyword Spotting

The DyConvMixer model is proposed, which is an efficient and effective model that has less than 200K parameters and uses less than 11M MACs and shows competitive results on the publicly available Hey-Snips and Hey-Snapdragon datasets.



High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation

Experimental results show that remarkable performance improvements can be achieved by using multiple examples per query and through the late (score-level) fusion of different subsystems, each based on a different set of phone posteriors.

Sparse Subspace Modeling for Query by Example Spoken Term Detection

Three different QbE-STD systems based on sparse model recovery are investigated, one of which proposes to regularize the template matching local distances using sparse reconstruction errors, and the other two aim at merging template matching and sparsity-based approaches to further improve the performance.

Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection

A framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN) and performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese).

Model-Based Unsupervised Spoken Term Detection with Spoken Queries

A set of model-based approaches for unsupervised spoken term detection (STD) with spoken queries that requires neither speech recognition nor annotated data and the usefulness of ASMs for STD in zero-resource settings and the potential of an instantly responding STD system using ASM indexing are demonstrated.

Query by Example Search on Speech at Mediaeval 2015

The task has been designed to get as close as possible to a practical use case scenario, in which a user would like to retrieve, using speech, utterances containing a given word or short sentence, including those with limited inflectional variations of words, some filler content and/or word re-orderings.

Subspace Regularized Dynamic Time Warping for Spoken Query Detection

Local DTW scores are integrated with the sparse reconstruction scores to obtain a subspace regularized distance matrix for DTW and the proposed method yields a substantial performance gain over the baseline system.

Query-by-example spoken term detection using phonetic posteriorgram templates

A query-by-example approach to spoken term detection in audio files designed for low-resource situations in which limited or no in-domain training material is available and accurate word-based speech recognition capability is unavailable.

Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams

An unsupervised learning framework is presented to address the problem of detecting spoken keywords by using segmental dynamic time warping to compare the Gaussian posteriorgrams between keyword samples and test utterances and obtaining the keyword detection result.

MediaEval 2013 Spoken Web Search Task: System Performance Measures

How to measure system performance in the Spoken Web Search (SWS) task at MediaEval 2013 is discussed, based on different sources, including the NIST 2006 Spoken Term detection (STD) Evaluation Plan.

Resource configurable spoken query detection using Deep Boltzmann Machines

A spoken query detection method based on posteriorgrams generated from Deep Boltzmann Machines (DBMs) that can be deployed in both semi-supervised and unsupervised training scenarios.