A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition

  title={A Comparison of Transformer, Convolutional, and Recurrent Neural Networks on Phoneme Recognition},
  author={Kyuhong Shim and Wonyong Sung},
Phoneme recognition is a very important part of speech recognition that requires the ability to extract phonetic features from multiple frames. In this paper, we compare and analyze CNN, RNN, Transformer, and Conformer models using phoneme recognition. For CNN, the ContextNet model is used for the experiments. First, we compare the accuracy of various architectures under different constraints, such as the receptive field length, parameter size, and layer depth. Sec-ond, we interpret the… 

Figures and Tables from this paper



Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions

A new end-to-end neural acoustic model for automatic speech recognition that achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models.

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

This paper proposes a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy and demonstrates that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate of 2.1%/4.6%.

Neural Speech Synthesis with Transformer Network

This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.

A Comparative Study on Transformer vs RNN in Speech Applications

An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient

Probing Acoustic Representations for Phonetic Properties

This work compares features from two conventional and four pre-trained systems in some simple frame-level phonetic classification tasks, with classifiers trained on features from one version of the TIMIT dataset and tested from another, to uncover relative strengths of various proposed acoustic representations.

Phoneme recognition: neural networks vs. hidden Markov models vs. hidden Markov models

A time-delay neural network for phoneme recognition that was able to invent without human interference meaningful linguistic abstractions in time and frequency such as formant tracking and segmentation and does not rely on precise alignment or segmentation of the input.

TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context

In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

  • Qian ZhangHan Lu Shankar Kumar
  • Computer Science, Physics
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
An end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system and shows that the full attention version of the model beats the-state-of-the art accuracy on the LibriSpeech benchmarks.

Phoneme recognition using time-delay neural networks

The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing