• Corpus ID: 3615351

End2You - The Imperial Toolkit for Multimodal Profiling by End-to-End Learning

  title={End2You - The Imperial Toolkit for Multimodal Profiling by End-to-End Learning},
  author={Panagiotis Tzirakis and Stefanos Zafeiriou and Bj{\"o}rn Schuller},
We introduce End2You -- the Imperial College London toolkit for multimodal profiling by end-to-end deep learning. End2You is an open-source toolkit implemented in Python and is based on Tensorflow. It provides capabilities to train and evaluate models in an end-to-end manner, i.e., using raw input. It supports input from raw audio, visual, physiological or other types of information or combination of those, and the output can be of an arbitrary representation, for either classification or… 

Figures and Tables from this paper

End2You: Multimodal Profiling by End-to-End Learning and Applications

End2You an open-source toolkit implemented in Python and based on Tensorflow provides capabilities to train and evaluate models in an end-to-end manner, i.e., using raw input, for either classification or regression tasks.

deepSELF: An Open Source Deep Self End-to-End Learning Framework

The proposed deepSELF toolkit can be used to analyse a variety of multi-modal signals, including images, audio, and single or multi-channel sensor data, and can be flexibly used not only as a single model but also as a fusion of such.

An End-to-End Deep Learning Framework for Speech Emotion Recognition of Atypical Individuals

This work presents three modeling methods under the end-to-end learning framework, namely CNN combined with extended features, CNN+RNN and ResNet, respec-tively, and investigates multiple data augmentation, balancing and sampling methods to further enhance the system performance.

Continuous-Time Audiovisual Fusion with Recurrence vs. Attention for In-The-Wild Affect Recognition

An extensive evaluation of the audiovisual fusion models shows that LSTM-RNNs can outperform the attention models when coupled with low-complex CNN backbones and trained in an end-to-end fashion, implying that attention models may not necessarily be the optimal choice for continuous-time multimodal emotion recognition.

Time-Continuous Audiovisual Fusion with Recurrence vs Attention for In-The-Wild Affect Recognition

This paper presents a comprehensive evaluation of fusion models based on LSTM-RNNs, self-attention, and cross-modal attention, trained for valence and arousal estimation, and indicates that attention models may not necessarily be the optimal choice for time-continuous multimodal fusion for emotion recognition.

Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge

This work investigates applying an ensemble of E2E models for robust performance and developing task-specific modifications for each task on ComParE 2020 tasks, and investigates the impact of multi-loss strategies on task performance.

Comparison of Artificial Neural Network Types for Infant Vocalization Classification

A unified neural network architecture scheme for audio classification is defined from which various network types are derived and the most influential architectural hyperparameter for all types were the integration operations for reducing tensor dimensionality between network stages.

Mask Detection and Breath Monitoring from Speech: on Data Augmentation, Feature Representation and Modeling

This paper introduces the approaches for the Mask and Breathing Sub-Challenge in the Interspeech COMPARE Challenge 2020, and investigates different bottleneck features based on the Bi-LSTM structure.

Deep Learning in Paralinguistic Recognition Tasks: Are Hand-crafted Features Still Relevant?

This year’s INTERSPEECH Computational Paralinguistic Challenge is taken as an opportunity to approach this issue by means of two corpora – Atypical Affect and Crying by training a Recurrent Neural Network to evaluate the performance of several hand-crafted feature sets of varying complexity.



End-to-End Multimodal Emotion Recognition Using Deep Neural Networks

This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used.

Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network

This paper proposes a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Deep learning for robust feature generation in audiovisual emotion recognition

A suite of Deep Belief Network models are proposed and evaluated, and it is demonstrated that these models show improvement in emotion classification performance over baselines that do not employ deep learning, suggesting that the learned high-order non-linear relationships are effective for emotion recognition.

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

The challenge guidelines, the common data used, and the performance of the baseline system on the two tasks are presented, to establish to what extent fusion of the approaches is possible and beneficial.

A Fast Learning Algorithm for Deep Belief Nets

A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

These advanced recurrent units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU), are found to be comparable to LSTM.

Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions

A new multimodal corpus of spontaneous collaborative and affective interactions in French: RECOLA is presented, which is being made available to the research community to take self-report measures of users during task completion.

Mel Frequency Cepstral Coefficients for Music Modeling

The results show that the use of the Mel scale for modeling music is at least not harmful for this problem, although further experimentation is needed to verify that this is the optimal scale in the general case and whether this transform is valid for music spectra.

Perceptual linear predictive (PLP) analysis of speech.

  • H. Hermansky
  • Physics
    The Journal of the Acoustical Society of America
  • 1990
A new technique for the analysis of speech, the perceptual linear predictive (PLP) technique, which uses three concepts from the psychophysics of hearing to derive an estimate of the auditory spectrum, and yields a low-dimensional representation of speech.