Non-linear frequency warping using constant-Q transformation for speech emotion recognition

  title={Non-linear frequency warping using constant-Q transformation for speech emotion recognition},
  author={Premjeet Singh and Goutam Saha and Md. Sahidullah},
  journal={2021 International Conference on Computer Communication and Informatics (ICCCI)},
In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT makes it more promising for SER than standard short-time Fourier transform (STFT). We present a… 

Figures and Tables from this paper

Modulation spectral features for speech emotion recognition using deep neural networks

Deep scattering network for speech emotion recognition

This paper introduces scattering transform for speech emotion recognition (SER), and investigates layer-wise scattering coefficients to analyse the importance of time shift and deformation stable scalogram and modulation spectrum coefficients for SER.

Attention Based Convolutional Neural Network with Multi-frequency Resolution Feature for Environment Sound Classification

A novel multi-frequency resolution (MFR) feature is proposed in this paper to solve the problem that the existing single frequency resolution time–frequency features of sound cannot effectively express the characteristics of multiple types of sound.

Convolution-Vision Transformer for Automatic Lung Sound Classification

This work proposes a hybrid Convolution-Vision Transformer architecture that explores the usage of Convolutional with Vision Transformers in a single system and evaluates the effectiveness of this method on ICBHI 2017 database.



Amplitude-Frequency Analysis of Emotional Speech Using Transfer Learning and Classification of Spectrogram Images

This study used an indirect approach to provide insights into the amplitude-frequency characteristics of different emotions in order to support the development of future, more efficiently differentiating SER methods.

Speech emotion recognition with deep convolutional neural networks

Formant position based weighted spectral features for emotion recognition

Synthetic speech detection using fundamental frequency variation and spectral features

A comparative study of traditional and newly proposed features for recognition of speech under stress

The results show that unlike fast Fourier transform's (FFT) immunity to noise, the linear prediction power spectrum is more immune than FFT to stress as well as to a combination of a noisy and stressful environment.

Towards a standard set of acoustic features for the processing of emotion in speech.

Researchers concerned with the automatic recognition of human emotion in speech have proposed a considerable variety of segmental and supra-segmental acoustic descriptors. These range from prosodic

Multiscale Amplitude Feature and Significance of Enhanced Vocal Tract Information for Emotion Classification

A novel multiscale amplitude feature is proposed using multiresolution analysis (MRA) and the significance of the vocal tract is investigated for emotion classification from the speech signal and the proposed feature outperforms the other features.

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing

A basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis, is proposed and intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters.

Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching

This paper explores how to utilize a DCNN to bridge the affective gap in speech signals, and finds that the DCNN model pretrained for image applications performs reasonably good in affective speech feature extraction.