Voice Quality and Pitch Features in Transformer-Based Speech Recognition

  title={Voice Quality and Pitch Features in Transformer-Based Speech Recognition},
  author={Guillermo C'ambara and Jordi Luque and Mireia Farr'us},
Jitter and shimmer measurements have shown to be carriers of voice quality and prosodic information which enhance the performance of tasks like speaker recognition, diarization or automatic speech recognition (ASR). However, such features have been seldom used in the context of neural-based ASR, where spectral features often prevail. In this work, we study the effects of incorporating voice quality and pitch features altogether and separately to a Transformer-based ASR model, with the intuition… 

Figures and Tables from this paper



Convolutional Speech Recognition with Pitch and Voice Quality Features

This work combines pitch and voice quality features with mel-frequency spectral coefficients (MFSCs) to train a convolutional architecture with Gated Linear Units (Conv GLUs).

Jitter and Shimmer Measurements for Speaker Diarization

The experiments carried out on the AMI corpus show that incorporating jitter and shimmer measurements to the baseline spectral features decreases the diarization error rate in most of the recordings.

Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification

This paper proposes to enhance the proposed Siamese convolutional neural network architecture by deploying a multilayer perceptron network to incorporate the prosodic, jitter, and shimmer features.

A pitch extraction algorithm tuned for automatic speech recognition

An algorithm that produces pitch and probability-of-voicing estimates for use as features in automatic speech recognition systems, which give large performance improvements on tonal languages for ASR systems, and even substantial improvements for non-tonal languages.

Using pitch frequency information in speech recognition

The results show that pitch frequency can indeed be used in ASR systems to improve the recognition performance and different ways to include pitch frequency in state-of-the-art hybrid HMM/ANN system are compared.

Automatic Recognition System for Dysarthric Speech Based on MFCC’s, PNCC’s, JITTER and SHIMMER Coefficients

This paper concatenate several variants of JITTER and SHIMMER with the techniques of speech parameterization to improve an automatic recognition of the dysarthric word system.

Perception of aperiodicity in pathological voice.

It is suggested that jitter and shimmer are not useful as independent indices of perceived vocal quality, apart from their acoustic contributions to the overall pattern of spectrally shaped noise in a voice.

Stress and Emotion Classification using Jitter and Shimmer Features

  • Xi LiJ. Tao J. Newman
  • Computer Science
    2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
  • 2007
The appended jitter and shimmer features result in an increase in classification accuracy for several illustrative datasets, including the SUSAS dataset for human speaking styles as well as vocalizations labeled by arousal level for African elephant and Rhesus monkey species.

Residual Convolutional CTC Networks for Automatic Speech Recognition

Experimental results show that the proposed single system RCNN-CTC can achieve the lowest word error rate (WER) on WSJ and Tencent Chat data sets, compared to several widely used neural network systems in ASR.