Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content

  title={Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content},
  author={Hongliang Yu and Liangke Gui and Michael A. Madaio and Amy E. Ogan and Justine Cassell and Louis-Philippe Morency},
  journal={Proceedings of the 25th ACM international conference on Multimedia},
The sheer amount of human-centric multimedia content has led to increased research on human behavior understanding. Most existing methods model behavioral sequences without considering the temporal saliency. This work is motivated by the psychological observation that temporally selective attention enables the human perceptual system to process the most relevant information. In this paper, we introduce a new approach, named Temporally Selective Attention Model (TSAM), designed to selectively… 

Figures and Tables from this paper

Spatio-Temporal Attention Model Based on Multi-view for Social Relation Understanding

A novel Spatio-Temporal attention model based on Multi-View (STMV) for understanding social relations from video achieves the state-of-the-art performance on the SRIV video dataset for social relation classification.

Recognizing Social Signals with Weakly Supervised Multitask Learning for Multimodal Dialogue Systems

This paper introduces weakly supervised learning (WSL) algorithms to such an inaccurate supervision setting in which the target label is not necessarily accurate, and shows that the proposed approach achieves less accuracy degradation than an existing training algorithm for a DNN in a cross-corpus setting.

Multi-Attention Multimodal Sentiment Analysis

A model of Multi-Attention Recurrent Neural Network (MA-RNN) for performing sentiment analysis on multimodal data that achieves the state-of-the-art performance on the Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis dataset.

Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language

This paper proposes a novel approach to selecting effective sentiment-relevant words for multi-modal sentiment analysis with focus on both the textual and acoustic modalities and employs a deep reinforcement learning mechanism to do so.

Multimodal Local-Global Ranking Fusion for Emotion Recognition

This work approaches emotion recognition from both direct person-independent and relative person-dependent perspectives and displays excellent performance on an audio-visual emotion recognition benchmark and improves over other algorithms for multimodal fusion.

Modeling the Clause-Level Structure to Multimodal Sentiment Analysis via Reinforcement Learning

A novel approach to multimodal sentiment analysis with focus on both textual and acoustic modalities is proposed, utilizing deep reinforcement learning to explore the clause-level structure in an utterance.

Unified Multi-Modal Multi-Task Joint Learning for Language-Vision Relation Inference

  • Wenjie LuDong Zhang
  • Computer Science
    2022 IEEE International Conference on Multimedia and Expo (ICME)
  • 2022
This paper mainly focuses on LVRI upon text-image pairs in Twitter with a unified multi- modal multi-task joint learning approach, which leverages a relevant multi - modal task on the external dataset as an auxiliary task to facilitate the LVRI task.

Multimodal Sentiment Analysis via RNN variants

This paper proposed four different variants of RNN, namely, GRNN, LRNN, GLRNN and UGRNN for analyzing the utterances of the speakers from the videos to achieve better sentiment classification accuracy on individual modality than existing approaches on the same dataset.

A survey of neural models for the automatic analysis of conversation: Towards a better integration of the social sciences

This paper surveys neural architectures for detecting emotion, dialogue acts, and sentiment polarity, and describes what it believes to be the most fundamental and definitional feature of conversation, which is its co-construction over time by two or more interlocutors.



EmoNets: Multimodal deep learning approaches for emotion recognition in video

This paper explores multiple methods for the combination of cues from these modalities into one common classifier, which achieves a considerably greater accuracy than predictions from the strongest single-modality classifier.

Temporal Attention-Gated Model for Robust Sequence Classification

The Temporal Attention-Gated Model (TAGM) is presented which integrates ideas from attention models and gated recurrent networks to better deal with noisy or unsegmented sequences.

Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network

This paper proposes a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation.

Select-additive learning: Improving generalization in multimodal sentiment analysis

This paper proposes a Select-Additive Learning (SAL) procedure that improves the generalizability of trained neural networks for multimodal sentiment analysis and shows that this approach improves prediction accuracy significantly in all three modalities (verbal, acoustic, visual), as well as in their fusion.

Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis

A Select-Additive Learning (SAL) procedure that improves the generalizability of trained discriminative neural networks and increases prediction accuracy significantly in all three modalities (text, audio, video), as well as in their fusion.

Predicting Personalized Emotion Perceptions of Social Images

Rolling multi-task hypergraph learning is presented to consistently combine these factors and a learning algorithm is designed for automatic optimization to predict the personalized emotion perceptions of images for each individual viewer.

Emotion spotting: discovering regions of evidence in audio-visual emotion expressions

A data-driven framework to explore patterns (timings and durations) of emotion evidence, specific to individual emotion classes, and it is demonstrated that these patterns vary as a function of which modality (lower face, upper face, or speech) is examined, and consistent patterns emerge across different folds of experiments.

LSTM for dynamic emotion and group emotion recognition in the wild

This paper extracts acoustic features, LBPTOP, Dense SIFT and CNN-LSTM features to recognize the emotions of film characters, and uses a fusion network to combine all the extracted features at the decision level for group level emotion recognition sub-challenge.

Aspect Level Sentiment Classification with Deep Memory Network

A deep memory network for aspect level sentiment classification that explicitly captures the importance of each context word when inferring the sentiment polarity of an aspect, and is also fast.