• Publications
  • Influence
Exemplar-Based Processing for Speech Recognition: An Overview
TLDR
The goal of modeling is to establish a generalization from the set of observed data such that accurate inference can be made about the data yet to be observed, which is referred to as unseen data.
Automatic acoustic synthesis of human-like laughter.
TLDR
A technique to synthesize laughter based on time-domain behavior of real instances of human laughter is presented, and results of subjective tests to assess the acceptability and naturalness of the synthetic laughter relative to real human laughter samples are presented.
Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition
TLDR
New acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly are developed and incorporated into the acoustic model.
Multi-geometry Spatial Acoustic Modeling for Distant Speech Recognition
TLDR
This work proposes to unify an acoustic model framework by optimizing spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input and demonstrates the effectiveness of such MC neural networks through ASR experiments on the real-world far-field data.
Audio retrieval by latent perceptual indexing
TLDR
A query-by-example audio retrieval framework by indexing audio clips in a generic database as points in a latent perceptual space, which reveals that the system performance is comparable to other proposed methods.
Self-Supervised Learning with Cross-Modal Transformers for Emotion Recognition
TLDR
This work learns multi-modal representations using a transformer trained on the masked language modeling task with audio, visual and text features that can improve the emotion recognition performance by up to 3% compared to the baseline.
Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-student Learning
TLDR
This work adopts the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition performance under multimedia noise and applies a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher.
Multimodal and Multiresolution Speech Recognition with Transformers
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We particularly focus on the scene context provided by the visual information,
Saliency-driven unstructured acoustic scene classification using latent perceptual indexing
TLDR
Results on the BBC sound effects library indicates that using the saliency-driven attention selection approach presented in this paper, 17.5% relative improvement can be obtained in frame-based classification and 25% relative improved can be obtaining using the latent audio indexing approach.
Emotion classification in children's speech using fusion of acoustic and linguistic features
TLDR
A system to detect angry vs. non-angry utterances of children who are engaged in dialog with an Aibo robot dog, submitted to the Interspeech2009 Emotion Challenge evaluation.
...
...