A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition
@article{Shim2022ACO, title={A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition}, author={Kyuhong Shim and Wonyong Sung}, journal={ArXiv}, year={2022}, volume={abs/2210.00367} }
Phoneme recognition is a very important part of speech recognition that requires the ability to extract phonetic features from multiple frames. In this paper, we compare and analyze CNN, RNN, Transformer, and Conformer models using phoneme recognition. For CNN, the ContextNet model is used for the experiments. First, we compare the accuracy of various architectures under different constraints, such as the receptive field length, parameter size, and layer depth. Sec-ond, we interpret the…
Figures and Tables from this paper
References
SHOWING 1-10 OF 39 REFERENCES
Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
A new end-to-end neural acoustic model for automatic speech recognition that achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models.
ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
- Computer ScienceINTERSPEECH
- 2020
This paper proposes a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy and demonstrates that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate of 2.1%/4.6%.
Neural Speech Synthesis with Transformer Network
- Computer ScienceAAAI
- 2019
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.
A Comparative Study on Transformer vs RNN in Speech Applications
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.
Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions
- Computer ScienceINTERSPEECH
- 2019
We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient…
Probing Acoustic Representations for Phonetic Properties
- Computer ScienceICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2021
This work compares features from two conventional and four pre-trained systems in some simple frame-level phonetic classification tasks, with classifiers trained on features from one version of the TIMIT dataset and tested from another, to uncover relative strengths of various proposed acoustic representations.
Phoneme recognition: neural networks vs. hidden Markov models vs. hidden Markov models
- Computer ScienceICASSP-88., International Conference on Acoustics, Speech, and Signal Processing
- 1988
A time-delay neural network for phoneme recognition that was able to invent without human interference meaningful linguistic abstractions in time and frequency such as formant tracking and segmentation and does not rely on precise alignment or segmentation of the input.
TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers…
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
- Computer Science, PhysicsICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks.
Phoneme recognition using time-delay neural networks
- Computer ScienceIEEE Trans. Acoust. Speech Signal Process.
- 1989
The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing…