Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content

@article{Yu2017TemporallySA,
  title={Temporally Selective Attention Model for Social and Affective State Recognition in Multimedia Content},
  author={Hongliang Yu and Liangke Gui and Michael A. Madaio and Amy E. Ogan and Justine Cassell and Louis-Philippe Morency},
  journal={Proceedings of the 25th ACM international conference on Multimedia},
  year={2017}
}
The sheer amount of human-centric multimedia content has led to increased research on human behavior understanding. Most existing methods model behavioral sequences without considering the temporal saliency. This work is motivated by the psychological observation that temporally selective attention enables the human perceptual system to process the most relevant information. In this paper, we introduce a new approach, named Temporally Selective Attention Model (TSAM), designed to selectively… 
Spatio-Temporal Attention Model Based on Multi-view for Social Relation Understanding
TLDR
A novel Spatio-Temporal attention model based on Multi-View (STMV) for understanding social relations from video achieves the state-of-the-art performance on the SRIV video dataset for social relation classification.
Multi-Attention Multimodal Sentiment Analysis
TLDR
A model of Multi-Attention Recurrent Neural Network (MA-RNN) for performing sentiment analysis on multimodal data that achieves the state-of-the-art performance on the Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis dataset.
Multi-level region-based Convolutional Neural Network for image emotion classification
TLDR
A multi-level region-based Convolutional Neural Network framework is proposed to discover the sentimental response of local regions and a multi-task loss function to take the probabilities of images belonging to different emotion classes into consideration is proposed.
Effective Sentiment-relevant Word Selection for Multi-modal Sentiment Analysis in Spoken Language
TLDR
This paper proposes a novel approach to selecting effective sentiment-relevant words for multi-modal sentiment analysis with focus on both the textual and acoustic modalities and employs a deep reinforcement learning mechanism to do so.
Multimodal Local-Global Ranking Fusion for Emotion Recognition
TLDR
This work approaches emotion recognition from both direct person-independent and relative person-dependent perspectives and displays excellent performance on an audio-visual emotion recognition benchmark and improves over other algorithms for multimodal fusion.
Modeling the Clause-Level Structure to Multimodal Sentiment Analysis via Reinforcement Learning
TLDR
A novel approach to multimodal sentiment analysis with focus on both textual and acoustic modalities is proposed, utilizing deep reinforcement learning to explore the clause-level structure in an utterance.
Multimodal Sentiment Analysis via RNN variants
TLDR
This paper proposed four different variants of RNN, namely, GRNN, LRNN, GLRNN and UGRNN for analyzing the utterances of the speakers from the videos to achieve better sentiment classification accuracy on individual modality than existing approaches on the same dataset.
Sentiment analysis using deep learning architectures: a review
TLDR
This paper provides a detailed survey of popular deep learning models that are increasingly applied in sentiment analysis and presents a taxonomy of sentiment analysis, which highlights the power of deep learning architectures for solving sentiment analysis problems.
Inferring stance in news broadcasts from prosodic-feature configurations
TLDR
This work identifies 14 aspects of stance that occur frequently in radio news stories and that could be useful for information retrieval, including indications of subjectivity, immediacy, local relevance, and newness.
Slices of Attention in Asynchronous Video Job Interviews
TLDR
A methodology to automatically extract slices where there is a rise of attention (attention slices) is proposed and it is shown that they bear significantly more information for hirability than randomly sampled slices, and that such information is related to visual cues associated with anxiety and turn taking.
...
1
2
...

References

SHOWING 1-10 OF 64 REFERENCES
Emotion in Context: Deep Semantic Feature Fusion for Video Emotion Recognition
TLDR
The proposed framework can realize recognition in real-time and achieves state-of-the-art performance on two challenging emotion recognition benchmarks, 50.6 on VideoEmotion and 51.8 on Ekman.
EmoNets: Multimodal deep learning approaches for emotion recognition in video
TLDR
This paper explores multiple methods for the combination of cues from these modalities into one common classifier, which achieves a considerably greater accuracy than predictions from the strongest single-modality classifier.
Temporal Attention-Gated Model for Robust Sequence Classification
TLDR
The Temporal Attention-Gated Model (TAGM) is presented which integrates ideas from attention models and gated recurrent networks to better deal with noisy or unsegmented sequences.
Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network
TLDR
This paper proposes a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation.
Select-additive learning: Improving generalization in multimodal sentiment analysis
TLDR
This paper proposes a Select-Additive Learning (SAL) procedure that improves the generalizability of trained neural networks for multimodal sentiment analysis and shows that this approach improves prediction accuracy significantly in all three modalities (verbal, acoustic, visual), as well as in their fusion.
Fusing audio, visual and textual clues for sentiment analysis from multimodal content
TLDR
This paper proposes a novel methodology for multimodal sentiment analysis, which consists in harvesting sentiments from Web videos by demonstrating a model that uses audio, visual and textual modalities as sources of information.
Select-Additive Learning: Improving Cross-individual Generalization in Multimodal Sentiment Analysis
TLDR
A Select-Additive Learning (SAL) procedure that improves the generalizability of trained discriminative neural networks and increases prediction accuracy significantly in all three modalities (text, audio, video), as well as in their fusion.
Predicting Personalized Emotion Perceptions of Social Images
TLDR
Rolling multi-task hypergraph learning is presented to consistently combine these factors and a learning algorithm is designed for automatic optimization to predict the personalized emotion perceptions of images for each individual viewer.
Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis
TLDR
A novel method to extract features from visual and textual modalities using deep convolutional neural networks and significantly outperform the state of the art of multimodal emotion recognition and sentiment analysis on different datasets is presented.
Utterance-Level Multimodal Sentiment Analysis
TLDR
It is shown that multimodal sentiment analysis can be effectively performed, and that the joint use of visual, acoustic, and linguistic modalities can lead to error rate reductions of up to 10.5% as compared to the best performing individual modality.
...
1
2
3
4
5
...