• Corpus ID: 27194751

Convolutional neural networks for acoustic modeling of raw time signal in LVCSR

  title={Convolutional neural networks for acoustic modeling of raw time signal in LVCSR},
  author={Pavel Golik and Zolt{\'a}n T{\"u}ske and Ralf Schl{\"u}ter and Hermann Ney},
In this paper we continue to investigate how the deep neural network (DNN) based acoustic models for automatic speech recognition can be trained without hand-crafted feature extraction. [] Key Method This also allows us to interpret the weights of the second convolutional layer in the same way as 2D patches learned on critical band energies by typical convolutional neural networks. The evaluation is performed on an English LVCSR task. Trained on the raw time signal, the convolutional layers allow to reduce…

Figures and Tables from this paper

Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing

This paper extends the waveform based NN model by a second level of time-convolutional element, which generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations.

Multi-timescale Feature-extraction Architecture of Deep Neural Networks for Acoustic Model Training from Raw Speech Signal

A multi-timescale feature-extraction architecture of DNNs with blocks of different time scales, which enable capturing long- and short-term patterns of speech signals, which reduced the word error rate by about 3% in experiments.

Acoustic Modelling from the Signal Domain Using CNNs

The resulting ‘direct-fromsignal’ network is competitive with state of the art networks based on conventional features with iVector adaptation and, unlike some previous work on learned feature extractors, the objective function converges as fast as for a network based on traditional features.

Very deep convolutional neural networks for raw waveforms

This work proposes very deep convolutional neural networks that directly use time-domain waveforms as inputs that are efficient to optimize over very long sequences, necessary for processing acoustic waveforms.

Convolutional Neural Networks for Raw Speech Recognition

CNN-based acoustic model for raw speech signal is discussed, which establishes the relation between rawspeech signal and phones in a data-driven manner and performs better than traditional cepstral fea- ture-based systems.

On Architectures and Training for Raw Waveform Feature Extraction in ASR

This work investigates the usefulness of one of these front-end frameworks, namely wav2vec, in a setting without additional untranscribed data for hybrid ASR systems, and studies the benefits of using the pre-trained feature extractor and how to additionally exploit an ex-isting acoustic model trained with different features.

Multi-Span Acoustic Modelling using Raw Waveform Signals

A novel multi-span structure for acoustic modelling based on the raw waveform with multiple streams of CNN input layers, each processing a different span of the rawwaveform signal is proposed.

Multi-scale feature based convolutional neural networks for large vocabulary speech recognition

  • Tong FuXihong Wu
  • Computer Science
    2017 IEEE International Conference on Multimedia and Expo (ICME)
  • 2017
A multi-scale method to mitigate the tradeoff and a model architecture that enables to analyze speech at multiple scale is proposed and experimental results show that the proposed model architecture can obtain significant performance improvement.

End-to-End Acoustic Modeling Using Convolutional Neural Networks


An end-to-end acoustic modeling approach using convolution neural networks, where the CNN takes as input raw speech signal and estimates the HMM states class conditional probabilities at the output, which yields consistently a better system with fewer parameters when compared to the conventional approach of cepstral feature extraction followed by ANN training.



Acoustic modeling with deep neural networks using raw time signal for LVCSR

Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, the DNN is trained on a combination of multiple short-term features, illustrating how the Dnn can learn from the little differences between MFCC, PLP and Gammatone features.

Improvements to Deep Convolutional Neural Networks for LVCSR

A deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features is conducted and an effective strategy to use dropout during Hessian-free sequence training is introduced.

Convolutional Neural Networks-based continuous speech recognition using raw speech signal

The studies show that the CNN-based approach achieves better performance than the conventional ANN- based approach with as many parameters and that the features learned from raw speech by the CNN -based approach could generalize across different databases.

Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition

The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.

Robust CNN-based speech recognition with Gabor filter kernels

A neural network architecture called a Gabor Convolutional Neural Network (GCNN) is proposed that incorporates Gabor functions into convolutional filter kernels and performs better than other noise-robust features that have been tried, namely, ETSI-AFE, PNCC, Gabor features without the CNN-based approach, and the best neural network features that don’t incorporateGabor functions.

Learning filter banks within a deep neural network framework

This paper replaces the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network, and shows that on a 50-hour English Broadcast News task, it can achieve a 5% relative improvement in word error rate using thefilter bank learning approach.

RASR/NN: The RWTH neural network toolkit for speech recognition

The results show that RASR achieves a state-of-the-art performance on a real-world large vocabulary task, while offering a complete pipeline for building and applying large scale speech recognition systems.

Phoneme recognition using time-delay neural networks

The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing

Neural network acoustic models for the DARPA RATS program

It is found that both Multi-Layer Perceptrons and Convolutional Neural Networks (CNNs) outperform Gaussian Mixture Models (GMMs) on a very LVCSR task.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.