Fusing Acoustic Feature Representations for Computational Paralinguistics Tasks

  title={Fusing Acoustic Feature Representations for Computational Paralinguistics Tasks},
  author={Heysem Kaya and Alexey Karpov},
The field of Computational Paralinguistics is rapidly growing and is of interest in various application domains ranging from biomedical engineering to forensics. The INTERSPEECH ComParE challenge series has a field-leading role, introducing novel problems with a common benchmark protocol for comparability. In this work, we tackle all three ComParE 2016 Challenge corpora (Native Language, Sincerity and Deception) benefiting from multi-level normalization on features followed by fast and robust… 

Figures and Tables from this paper

Implementing Fusion Techniques for the Classification of Paralinguistic Information
This work tests several classification techniques and acoustic features and further combines them using late fusion to classify paralinguistic information for the ComParE 2018 challenge, and proposes to use raw-waveform convolutional neural networks (CNN) in the context of three paraleduistic sub-challenges.
Introducing Weighted Kernel Classifiers for Handling Imbalanced Paralinguistic Corpora: Snoring, Addressee and Cold
This work proposes a new weighting scheme on instances of the original dataset, employing Weighted Kernel Extreme Learning Machine, and inspired from that, introducing the Weighted Partial Least Squares Regression based classifier.
Computational Paralinguistics: Automatic Assessment of Emotions, Mood and Behavioural State from Acoustics of Speech
This paper addresses the Interspeech 2018 Computational Paralinguistics Challenge (ComParE) which aims to push the boundaries of sensitivity to non-textual information that is conveyed in the acoustics of speech, and posit that a substantial amount of paralinguistic information is contained in spectral features alone.
Combining Clustering and Functionals based Acoustic Feature Representations for Classification of Baby Sounds
This paper investigates different fusion strategies as well as provides insights on their effectiveness alongside standalone classifiers in the framework of paralinguistic analysis of infant vocalizations, and outperforms the challenge baseline Unweighted Average Recall (UAR) score and achieve a comparable result to the state-of-the-art.
The most efficient computer-based system for detection and classification of the corresponding acoustical paralinguistic events is developed, and the architecture of this system, its main modules and methods are described.
Automatic Detection of Speech Under Cold Using Discriminative Autoencoders and Strength Modeling with Multiple Sub-Dictionary Generation
This paper presents two frameworks, one of them based on an alternative neural network-based autoencoder using two different loss functions, and another based on strength modeling, where diverse classifiers' confidence outputs are concatenated to original feature space as input to the support vector machine.
Video-based emotion recognition in the wild
LSTM Based Cross-corpus and Cross-task Acoustic Emotion Recognition
This work investigates the suitability of Long-Short-Term-Memory models trained with timeand space-continuously annotated affective primitives for cross-corpus acoustic emotion recognition and employs an effective approach to use the frame level valence and arousal predictions of LSTM models for utterance level affect classification.
Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition
This study demonstrates that exploiting task-specific dictionaries and resources can boost the performance of linguistic models, when the amount of labeled data is small, and proposes a bi-modal framework, where these tasks are modeled using state-of-the-art acoustic and linguistic features.


Fisher vectors with cascaded normalization for paralinguistic analysis
This paper addresses the variability compensation issue by proposing a novel method composed of i) Fisher vector encoding of low level descriptors extracted from the signal, ii) speaker z-normalization applied after speaker clustering, and iii) non-linear normalization of features.
Random Discriminative Projection Based Feature Selection with Application to Conflict Recognition
A recent discriminative projection based feature selection method is extended using the power of stochasticity to overcome local minima and to reduce the computational complexity of paralinguistic speech analysis.
The INTERSPEECH 2010 paralinguistic challenge
The INTERSPEECH 2010 Paralinguistic Challenge shall help overcome the usually low compatibility of results, by addressing three selected sub-challenges, by address-ing three selected tasks.
The INTERSPEECH 2015 computational paralinguistics challenge: nativeness, parkinson's & eating condition
Three sub-challenges are described: the estimation of the degree of nativeness, the neurological state of patients with Parkinson’s condition, and the eating conditions of speakers, i.
Combining modality-specific extreme learning machines for emotion recognition in the wild
This paper proposes extreme learning machines (ELM) for modeling audio and video features for emotion recognition under uncontrolled conditions and compares ELM with partial least squares regression based classification that is used in the top performing system of EmotiW 2014, and discusses the advantages of both approaches.
The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language
The INTERSPEECH 2016 Computational Paralinguistics Challenge addresses three different problems for the first time in research competition under well-defined conditions: classification of deceptive
The INTERSPEECH 2009 emotion challenge
The challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech, and the FAU Aibo Emotion Corpus are introduced.
Fisher Kernels on Visual Vocabularies for Image Categorization
  • F. Perronnin, C. Dance
  • Computer Science
    2007 IEEE Conference on Computer Vision and Pattern Recognition
  • 2007
This work shows that Fisher kernels can actually be understood as an extension of the popular bag-of-visterms, and proposes to apply this framework to image categorization where the input signals are images and where the underlying generative model is a visual vocabulary: a Gaussian mixture model which approximates the distribution of low-level features in images.
Recent developments in openSMILE, the munich open-source multimedia feature extractor
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features
Improving the Fisher Kernel for Large-Scale Image Classification
In an evaluation involving hundreds of thousands of training images, it is shown that classifiers learned on Flickr groups perform surprisingly well and that they can complement classifier learned on more carefully annotated datasets.