Audiovisual classification of vocal outbursts in human conversation using Long-Short-Term Memory networks

Abstract

We investigate classification of non-linguistic vocalisations with a novel audiovisual approach and Long Short-Term Memory (LSTM) Recurrent Neural Networks as highly successful dynamic sequence classifiers. As database of evaluation serves this year's Paralinguistic Challenge's Audiovisual Interest Corpus of human-to-human natural conversation. For video-based analysis we compare shape and appearance based features. These are fused in an early manner with typical audio descriptors. The results show significant improvements of LSTM networks over a static approach based on Support Vector Machines. More important, we can show a significant gain in performance when fusing audio and visual shape features.

DOI: 10.1109/ICASSP.2011.5947690

Extracted Key Phrases

5 Figures and Tables

Cite this paper

@article{Eyben2011AudiovisualCO, title={Audiovisual classification of vocal outbursts in human conversation using Long-Short-Term Memory networks}, author={Florian Eyben and Stavros Petridis and Bj{\"{o}rn W. Schuller and Georgios Tzimiropoulos and Stefanos Zafeiriou and Maja Pantic}, journal={2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2011}, pages={5844-5847} }