Corpus ID: 966801

Learning the speech front-end with raw waveform CLDNNs

@inproceedings{Sainath2015LearningTS,
  title={Learning the speech front-end with raw waveform CLDNNs},
  author={T. Sainath and Ron J. Weiss and A. Senior and K. Wilson and Oriol Vinyals},
  booktitle={INTERSPEECH},
  year={2015}
}
Learning an acoustic model directly from the raw waveform has been an active area of research. [...] Key Method Specifically, we will show the benefit of the CLDNN, namely the time convolution layer in reducing temporal variations, the frequency convolution layer for preserving locality and reducing frequency variations, as well as the LSTM layers for temporal modeling. In addition, by stacking raw waveform features with log-mel features, we achieve a 3% relative reduction in word error rate.Expand
Fully Convolutional Speech Recognition
TLDR
This paper presents an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language modeling, trained end-to-end to predict characters from theRaw waveform, removing the feature extraction step altogether. Expand
Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing
TLDR
This paper extends the waveform based NN model by a second level of time-convolutional element, which generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Expand
Direct modeling of raw audio with DNNS for wake word detection
TLDR
This work develops a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance, and shows the effectiveness of this stage-wise training technique through a set of experiments on real beam-formed far-field data. Expand
Feature Learning with Raw-Waveform CLDNNs for Voice Activity Detection
TLDR
This paper proposes a novel approach to VAD to tackle both feature and model selection jointly and shows that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments. Expand
Acoustic Model Adaptation from Raw Waveforms with Sincnet
TLDR
It is shown that the parameterisation of the SincNet layer is well suited for adaptation in practice: it can efficiently adapt with a very small number of parameters, producing error rates comparable to techniques using orders of magnitude more parameters. Expand
Acoustic Modeling from Frequency Domain Representations of Speech
TLDR
A frequency-domain feature-learning layer which can allow acoustic model training directly from the waveform and a new set of analytic filters using polynomial approximation, which outperforms log-Mel filters significantly while being equally fast. Expand
Harmonic feature fusion for robust neural network-based acoustic modeling
TLDR
This work proposes new features integrated into acoustic modeling to represent which parts in the time-frequency domain have a distinct harmonic structure, since it is partially observed in noisy environments. Expand
Learning Acoustic Features from the Raw Waveform for Automatic Speech Recognition
TLDR
Waveform based waveform based ASR modeling and training is investigated and analyzed for a publicly available medium sized data set, namely the CHiME-4 dataSet, which supplies real multichannel noisy data for training and evaluation. Expand
Attention-based Wav2Text with feature transfer learning
TLDR
Experimental results reveal that the proposed Attention-based Wav2Text model directly with raw waveform could achieve a better result in comparison with the attentional encoder-decoder model trained on standard front-end filterbank features. Expand
Multi-Span Acoustic Modelling using Raw Waveform Signals
TLDR
A novel multi-span structure for acoustic modelling based on the raw waveform with multiple streams of CNN input layers, each processing a different span of the rawwaveform signal is proposed. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 21 REFERENCES
Acoustic modeling with deep neural networks using raw time signal for LVCSR
TLDR
Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, the DNN is trained on a combination of multiple short-term features, illustrating how the Dnn can learn from the little differences between MFCC, PLP and Gammatone features. Expand
Speech acoustic modeling from raw multichannel waveforms
TLDR
A convolutional neural network - deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, and learns a similar feature representation through supervised training and outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions. Expand
Learning a better representation of speech soundwaves using restricted boltzmann machines
TLDR
A novel approach for modeling speech sound waves using a Restricted Boltzmann machine (RBM) with a novel type of hidden variable is presented and initial results demonstrate phoneme recognition performance better than the current state-of-the-art for methods based on Mel cepstrum coefficients. Expand
Improvements to Deep Convolutional Neural Networks for LVCSR
TLDR
A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced. Expand
Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks
TLDR
This paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and indicates that CNNs can learn features relevant for phoneme classification automatically from the rawspeech signal. Expand
Gammatone Features and Feature Combination for Large Vocabulary Speech Recognition
TLDR
The gammatone features presented here lead to competitive results on the EPPS English task, and considerable improvements were obtained by subsequent combination to a number of standard acoustic features, i.e. MFCC, PLP, MF-PLP, and VTLN plus voicedness. Expand
Long short-term memory recurrent neural network architectures for large scale acoustic modeling
TLDR
The first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines is introduced and it is shown that a two-layer deep LSTm RNN where each L STM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance. Expand
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks
TLDR
This paper takes advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture, and finds that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models. Expand
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs. Expand
Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se
TLDR
Several parametric representations of the acoustic signal were compared with regard to word recognition performance in a syllable-oriented continuous speech recognition system and the superior performance of the mel-frequency cepstrum coefficients may be attributed to the fact that they better represent the perceptually relevant aspects of the short-term speech spectrum. Expand
...
1
2
3
...