Neural Reasoning, Fast and Slow, for Video Question Answering

  title={Neural Reasoning, Fast and Slow, for Video Question Answering},
  author={Thao Minh Le and Vuong Le and Svetha Venkatesh and T. Tran},
  journal={2020 International Joint Conference on Neural Networks (IJCNN)},
What does it take to design a machine that learns to answer natural questions about a video? A Video QA system must simultaneously understand language, represent visual content over space-time, and iteratively transform these representations in response to lingual content in the query, and finally arriving at a sensible answer. While recent advances in lingual and visual question answering have enabled sophisticated representations and neural reasoning mechanisms, major challenges in Video QA… Expand
Hierarchical Conditional Relation Networks for Video Question Answering
A general-purpose reusable neural unit called Conditional Relation Network (CRN) is introduced that serves as a building block to construct more sophisticated structures for representation and reasoning over video. Expand
BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues
This work proposes Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues that achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark. Expand
Object-Centric Representation Learning for Video Question Answering
This work proposes a new query-guided representation framework to turn a video into an evolving relational graph of objects, whose features and interactions are dynamically and conditionally inferred. Expand
From Deep Learning to Deep Reasoning
  • Truyen Tran, Vuong Le, Hung Le, Thao Minh Le
  • Computer Science
  • KDD
  • 2021
This tutorial reviews recent developments to extend the capacity of neural networks to "learning-to-reason'' from data, where the task is to determine if the data entails a conclusion. Expand
Hierarchical Conditional Relation Networks for Multimodal Video Question Answering
Hierarchical Conditional Relation Networks (HCRN) is introduced, a general-reusable neural unit taking as input a set of tensorial objects and translating into a new set of objects that encode relations of the inputs that help ease the common complex model building process of Video QA. Expand
Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
Analysis into the model’s behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA. Expand
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
This work proposes to automatically generate question-answer pairs from transcribed video narrations leveraging a state-of-the-art text transformer pipeline and obtain a new large-scale VideoQA training dataset with reduced language biases and high quality annotations. Expand


Explore Multi-Step Reasoning in Video Question Answering
A new VideoQA model is developed which contains a new attention module, which contains spatial attention mechanism to address crucial and multiple logical sub-tasks embedded in questions, as well as a refined GRU called ta-GRU (temporal-attention GRU) to capture the long-term temporal dependency and gather complete visual cues. Expand
Focal Visual-Text Attention for Visual Question Answering
A novel neural network called Focal Visual-Text Attention network (FVTA) is described for collective reasoning in visual question answering, where both visual and text sequence information such as images and text metadata are presented. Expand
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding
This work proposes a neural-symbolic visual question answering system that first recovers a structural scene representation from the image and a program trace from the question, then executes the program on the scene representation to obtain an answer. Expand
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks. Expand
Learning to Reason: End-to-End Module Networks for Visual Question Answering
End-to-End Module Networks are proposed, which learn to reason by directly predicting instance-specific network layouts without the aid of a parser, and achieve an error reduction of nearly 50% relative to state-of-theart attentional approaches. Expand
Learning by Abstraction: The Neural State Machine
The Neural State Machine is introduced, seeking to bridge the gap between the neural and symbolic views of AI and integrate their complementary strengths for the task of visual reasoning, by transforming both the visual and linguistic modalities into semantic concept-based representations, thereby achieving enhanced transparency and modularity. Expand
A simple neural network module for relational reasoning
This work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations. Expand
Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering
In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn globalExpand
DeepStory: Video Story QA by Deep Embedded Memory Networks
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models. Expand
Motion-Appearance Co-memory Networks for Video Question Answering
The proposed motion-appearance co-memory network is built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA and outperform state-of-the-art significantly on all four tasks of TGIF-QA. Expand