• Corpus ID: 2610591

Analysis of CNN-based speech recognition system using raw speech as input

@inproceedings{Palaz2015AnalysisOC,
  title={Analysis of CNN-based speech recognition system using raw speech as input},
  author={Dimitri Palaz and Mathew Magimai.-Doss and Ronan Collobert},
  booktitle={INTERSPEECH},
  year={2015}
}
Abstract Automaticspeechrecognitionsystemstypicallymodeltherela-tionship between the acoustic speech signal and the phones intwo separate steps: feature extraction and classier training. Inourrecentworks, wehaveshownthat, intheframeworkofcon-volutionalneuralnetworks(CNN),therelationshipbetweentheraw speech signal and the phones can be directly modeled andASR systems competitive to standard approach can be built. Inthis paper, we rst analyze and show that, between the rst twoconvolutional layers… 

Figures and Tables from this paper

On Learning Vocal Tract System Related Speaker Discriminative Information from Raw Signal Using CNNs
TLDR
This work designs a CNN-based system, which models sub-segmental speech in the first convolution layer, with an hypothesis that such a system should learn vocal tract system related speaker discriminative information and shows that the proposed system indeed focuses on formant regions, yields competitive speaker verification system and is complementary to the CNN- based system that models fundamental frequency information.
Convolutional Neural Networks for Raw Speech Recognition
TLDR
CNN-based acoustic model for raw speech signal is discussed, which establishes the relation between rawspeech signal and phones in a data-driven manner and performs better than traditional cepstral fea- ture-based systems.
Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS
TLDR
Inspired by the success of neural network-based approaches to model directly raw speech signal for applications such as speech recognition, emotion recognition and anti-spoofing, a speaker verification approach where speaker discriminative information is directly learned from the speech signal is proposed.
Development of Visual and Audio Speech Recognition Systems Using Deep Neural Networks
  • D. Ivanko, D. Ryumin
  • Computer Science
    Proceedings of the 31th International Conference on Computer Graphics and Vision. Volume 2
  • 2021
TLDR
This paper designs end-to-end neural network for the low-resource lip-reading task and audio speech recognition task using 3D CNNs, pre-trained CNN weights of several state-of-the-art models and LSTMs and chooses 5 most promising model’s architectures and evaluated them on own data.
A Convenient and Extensible Offline Chinese Speech Recognition System Based on Convolutional CTC Networks
TLDR
The acoustic model based on CNN+CTC+Self-Attention and the corresponding language model is used to construct an end-to-end Chinese speech recognition system as a pre-training model and the combination of Levenshtein Distance and hashing method can achieve an accuracy of more than 90% on specific phrases.
End-to-End Acoustic Modeling Using Convolutional Neural Networks
TROPE R HCRAESE R PAID I END-TO-END ACOUSTIC MODELING USING CONVOLUTIONAL NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION
TLDR
An end-to-end acoustic modeling approach using convolution neural networks, where the CNN takes as input raw speech signal and estimates the HMM states class conditional probabilities at the output, which yields consistently a better system with fewer parameters when compared to the conventional approach of cepstral feature extraction followed by ANN training.
Hybrid attention convolution acoustic model based on variational autoencoder for speech recognition
TLDR
The goal of the project is to build a new acoustic model used in ASR system based on neural network, with the most significant difference is that autoencoder introduces which makes the training of acoustic module becoming semi-supervised learning from supervised learning.
Acoustic Modelling from the Signal Domain Using CNNs
TLDR
The resulting ‘direct-fromsignal’ network is competitive with state of the art networks based on conventional features with iVector adaptation and, unlike some previous work on learned feature extractors, the objective function converges as fast as for a network based on traditional features.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 28 REFERENCES
Convolutional Neural Networks-based continuous speech recognition using raw speech signal
TLDR
The studies show that the CNN-based approach achieves better performance than the conventional ANN- based approach with as many parameters and that the features learned from raw speech by the CNN -based approach could generalize across different databases.
Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks
TLDR
This paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and indicates that CNNs can learn features relevant for phoneme classification automatically from the rawspeech signal.
Acoustic modeling with deep neural networks using raw time signal for LVCSR
TLDR
Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, the DNN is trained on a combination of multiple short-term features, illustrating how the Dnn can learn from the little differences between MFCC, PLP and Gammatone features.
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
TLDR
The proposed CNN architecture is applied to speech recognition within the framework of hybrid NN-HMM model to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.
Connectionist Speech Recognition: A Hybrid Approach
From the Publisher: Connectionist Speech Recognition: A Hybrid Approach describes the theory and implementation of a method to incorporate neural network approaches into state-of-the-art continuous
Deep convolutional neural networks for LVCSR
TLDR
This paper determines the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks, and explores the behavior of neural network features extracted from CNNs on a variety of LVCSS tasks, comparing CNNs toDNNs and GMMs.
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
TLDR
A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
Investigating deep neural network based transforms of robust audio features for LVCSR
TLDR
This work has applied this novel feature extraction scheme onto two very different tasks, i.e. a clean speech task (DARPA-WSJ) and a real-life, open-vocabulary, mobile search task (Speak4itSM), always reporting improved performance.
Convolutional Neural Networks for Distant Speech Recognition
TLDR
This work investigates convolutional neural networks for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM), and proposes a channel-wise convolution with two-way pooling.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
...
1
2
3
...