• Corpus ID: 16107228

Developing a Speech Activity Detection System for the DARPA RATS Program

@inproceedings{Ng2012DevelopingAS,
  title={Developing a Speech Activity Detection System for the DARPA RATS Program},
  author={Tim Ng and Bing Zhang and Long Nguyen and Spyridon Matsoukas and Xinhui Zhou and Nima Mesgarani and Karel Vesel{\'y} and Pavel Matejka},
  booktitle={INTERSPEECH},
  year={2012}
}
This paper describes the speech activity detection (SAD) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present two approaches to SAD, one based on Gaussian mixture models, and one based on multi-layer perceptrons. We show that significant gains in SAD accuracy can be obtained by careful design of… 

Tables from this paper

Improving the speech activity detection for the DARPA RATS phase-3 evaluation

TLDR
This paper presents the work that was conducted for building the speech activity detection (SAD) systems for the phase 3 evaluation of the RATS program, and revealed that the bottleneck features were able to improve SAD performance on new channels significantly.

Improvements in language identification on the RATS noisy speech corpus

This paper presents a set of techniques that we used to develop the language identification (LID) system for the second phase of the DARPA RATS (Robust Automatic Transcription of Speech) program,

Improvements to the IBM speech activity detection system for the DARPA RATS program

TLDR
Improvements to the IBM speech activity detection (SAD) system for the third phase of the DARPA RATS program come from jointly training convolutional and regular deep neural networks with rich time-frequency representations of speech.

Developing a speaker identification system for the DARPA RATS project

This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to

Study on the Use of Deep Neural Networks for Speech Activity Detection in Broadcast Recordings

TLDR
Experimental results show that the use of the resulting SAD module leads to a slight improvement in transcription accuracy and a significant reduction in the computation time needed for transcription.

A phonetically aware system for speech activity detection

TLDR
This paper proposes a novel two-stage approach to SAD that attempts to model phonetic information in the signal more explicitly than in current systems, and test performance on matched and mismatched channels.

Patrol Team Language Identification System for DARPA RATS P1 Evaluation

This paper describes the language identification (LID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to

All for one: feature combination for highly channel-degraded speech activity detection

TLDR
This paper presents a feature combination approach to improve SAD on highly channel degraded speech as part of the Defense Advanced Research Projects Agency’s (DARPA) Robust Automatic Transcription of Speech (RATS) program and presents single, pairwise and all feature combinations.

Acoustic and Data-driven Features for Robust Speech Activity Detection

TLDR
The proposed front-end performs significantly better than standard acoustic feature extraction techniques in these noisy conditions and is used to train SAD systems based on Gaussian mixture models for processing of speech from multiple languages transmitted over noisy radio communication channels under the ongoing DARPA Robust Automatic Transcription of Speech (RATS) program.

The IBM RATS phase II speaker recognition system: overview and analysis

TLDR
IBM’s submission for the Phase II speaker recognition evaluation of the DARPA sponsored Robust Automatic Transcription of Speech (RATS) program is examined and the results indicate that for the 30s-30s task the performance of the overall system was better than the best single system.
...

References

SHOWING 1-10 OF 12 REFERENCES

Robust speech recognition in Noisy Environments: The 2001 IBM spine evaluation system

TLDR
The key components of the system include an HMM-based automatic segmentation module using a novel set of LDA-transformed voicing and energy features, a multiple-pass decoding strategy that uses several speaker-and environment-normalization operations to deal with the highly variable acoustics of the evaluation.

Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations

TLDR
A content-based audio classification algorithm based on novel multiscale spectro-temporal modulation features inspired by a model of auditory cortical processing to discriminate speech from nonspeech consisting of animal vocalizations, music, and environmental sounds is described.

The segmentation of multi-channel meeting recordings for automatic speech recognition

TLDR
This paper presents a system for the automatic segmentation of multiple-channel individual headset microphone (IHM) meeting recordings for automatic speech recognition that relies on an MLP classifier trained from several meeting room corpora to identify speech/non-speech segments of the recordings.

Fast speaker change detection for broadcast news transcription and indexing

TLDR
A new speaker change detection algorithm designed for fast transcription and audio indexing of spoken broadcast news, that begins with a gender-independent phone-class recognition pass and hypothesizes a speaker change boundary between every phone in the labeled input.

Perceptual linear predictive (PLP) analysis of speech.

  • H. Hermansky
  • Physics
    The Journal of the Acoustical Society of America
  • 1990
TLDR
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.

The RATS radio traffic collection system

TLDR
A system that takes a clean source signal and transmits it over eight different radio channels, where the variation from channel to channel results in a range of degradation modes, to build a corpus to address this research question.

Hierarchical Structures of Neural Networks for Phoneme Recognition

TLDR
This paper deals with phoneme recognition based on neural networks (NN), and focuses on temporal patterns (TRAPs) and novel split temporal context (STC) phoneme recognizers and investigates into tandem NN architectures.

Perceptual linear predictive (PLP) anal ysis of speech,”Journal of the Acoustical Society of America

  • 1990

A generalization of linear disc riminant analysis in maximum likelihood framework

  • Johns Hopk ins University, Tech. Rep., 1996.
  • 1996

A generalization of linear discriminant analysis in maximum likelihood framework

  • Johns Hopkins University, Tech. Rep., 1996. INTERSPEECH 2012 1972
  • 1996