• Corpus ID: 16107228

Developing a Speech Activity Detection System for the DARPA RATS Program

@inproceedings{Ng2012DevelopingAS,
  title={Developing a Speech Activity Detection System for the DARPA RATS Program},
  author={Tim Ng and Bing Zhang and Long Nguyen and Spyridon Matsoukas and Xinhui Zhou and Nima Mesgarani and Karel Vesel{\'y} and Pavel Matejka},
  booktitle={INTERSPEECH},
  year={2012}
}
This paper describes the speech activity detection (SAD) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present two approaches to SAD, one based on Gaussian mixture models, and one based on multi-layer perceptrons. We show that significant gains in SAD accuracy can be obtained by careful design of… 

Tables from this paper

Improving the speech activity detection for the DARPA RATS phase-3 evaluation
TLDR
This paper presents the work that was conducted for building the speech activity detection (SAD) systems for the phase 3 evaluation of the RATS program, and revealed that the bottleneck features were able to improve SAD performance on new channels significantly.
Improvements in language identification on the RATS noisy speech corpus
This paper presents a set of techniques that we used to develop the language identification (LID) system for the second phase of the DARPA RATS (Robust Automatic Transcription of Speech) program,
Improvements to the IBM speech activity detection system for the DARPA RATS program
TLDR
Improvements to the IBM speech activity detection (SAD) system for the third phase of the DARPA RATS program come from jointly training convolutional and regular deep neural networks with rich time-frequency representations of speech.
Developing a speaker identification system for the DARPA RATS project
This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to
Study on the Use of Deep Neural Networks for Speech Activity Detection in Broadcast Recordings
TLDR
Experimental results show that the use of the resulting SAD module leads to a slight improvement in transcription accuracy and a significant reduction in the computation time needed for transcription.
A phonetically aware system for speech activity detection
TLDR
This paper proposes a novel two-stage approach to SAD that attempts to model phonetic information in the signal more explicitly than in current systems, and test performance on matched and mismatched channels.
Patrol Team Language Identification System for DARPA RATS P1 Evaluation
This paper describes the language identification (LID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to
All for one: feature combination for highly channel-degraded speech activity detection
TLDR
This paper presents a feature combination approach to improve SAD on highly channel degraded speech as part of the Defense Advanced Research Projects Agency’s (DARPA) Robust Automatic Transcription of Speech (RATS) program and presents single, pairwise and all feature combinations.
Acoustic and Data-driven Features for Robust Speech Activity Detection
TLDR
The proposed front-end performs significantly better than standard acoustic feature extraction techniques in these noisy conditions and is used to train SAD systems based on Gaussian mixture models for processing of speech from multiple languages transmitted over noisy radio communication channels under the ongoing DARPA Robust Automatic Transcription of Speech (RATS) program.
Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021
TLDR
Experimental results show that features learned via unsupervised learning provide a much more robust representation, significantly reducing the mismatch observed between development and evaluation partition results.
...
...

References

SHOWING 1-10 OF 12 REFERENCES
Robust speech recognition in Noisy Environments: The 2001 IBM spine evaluation system
TLDR
The key components of the system include an HMM-based automatic segmentation module using a novel set of LDA-transformed voicing and energy features, a multiple-pass decoding strategy that uses several speaker-and environment-normalization operations to deal with the highly variable acoustics of the evaluation.
Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations
TLDR
A content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental sounds is described.
The segmentation of multi-channel meeting recordings for automatic speech recognition
TLDR
This paper presents a system for the automatic segmentation of multiple-channel individual headset microphone (IHM) meeting recordings for automatic speech recognition that relies on an MLP classifier trained from several meeting room corpora to identify speech/non-speech segments of the recordings.
Fast speaker change detection for broadcast news transcription and indexing
TLDR
A new speaker change detection algorithm designed for fast transcription and audio indexing of spoken broadcast news, that begins with a gender-independent phone-class recognition pass and hypothesizes a speaker change boundary between every phone in the labeled input.
Perceptual linear predictive (PLP) analysis of speech.
  • H. Hermansky
  • Physics
    The Journal of the Acoustical Society of America
  • 1990
TLDR
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.
The RATS radio traffic collection system
TLDR
A system that takes a clean source signal and transmits it over eight different radio channels, where the variation from channel to channel results in a range of degradation modes, to build a corpus to address this research question.
Hierarchical Structures of Neural Networks for Phoneme Recognition
TLDR
This paper deals with phoneme recognition based on neural networks (NN), and focuses on temporal patterns (TRAPs) and novel split temporal context (STC) phoneme recognizers and investigates into tandem NN architectures.
Maximum likelihood discriminant feature spaces
TLDR
A new approach to HDA is presented by defining an objective function which maximizes the class discrimination in the projected subspace while ignoring the rejected dimensions, and it is shown that, under diagonal covariance Gaussian modeling constraints, applying a diagonalizing linear transformation to the HDA space results in increased classification accuracy even though HDA alone actually degrades the recognition performance.
Perceptual linear predictive (PLP) anal ysis of speech,”Journal of the Acoustical Society of America
  • 1990
A generalization of linear disc riminant analysis in maximum likelihood framework
  • Johns Hopk ins University, Tech. Rep., 1996.
  • 1996
...
...