DeepStory: Video Story QA by Deep Embedded Memory Networks

@article{Kim2017DeepStoryVS,
  title={DeepStory: Video Story QA by Deep Embedded Memory Networks},
  author={Kyung-min Kim and Min-Oh Heo and Seongho Choi and Byoung-Tak Zhang},
  journal={ArXiv},
  year={2017},
  volume={abs/1707.00836}
}
Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. [...] Key Method The video stories are stored in a long-term memory component. For a given question, an LSTM-based attention model uses the long-term memory to recall the best question-story-answer triplet by focusing on specific words containing key information. We trained the DEMN on a novel QA dataset of children's cartoon video series…Expand
Video Question Generation via Cross-Modal Self-Attention Networks Learning
TLDR
This paper introduces a novel task for automatically generating questions given a sequence of video frames and the corresponding subtitles from a clip of video to reduce the huge annotation cost.
Learning Question-Guided Video Representation for Multi-Turn Video Question Answering
TLDR
This work proposes a proposed question-guided video representation module that efficiently generates the token-level video summary guided by each word in the question that is then fused with the question to generate the answer.
Multimodal Dual Attention Memory for Video Story Question Answering
TLDR
The best performance of the dual attention mechanism combined with late fusion by ablation studies are confirmed and MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models.
TVQA: Localized, Compositional Video Question Answering
TLDR
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
Motion-Appearance Co-memory Networks for Video Question Answering
TLDR
The proposed motion-appearance co-memory network is built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA and outperform state-of-the-art significantly on all four tasks of TGIF-QA.
Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions
TLDR
ROLL is a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.
Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents
TLDR
A Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module is put forward, which achieves the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'.
Adversarial Multimodal Network for Movie Story Question Answering
TLDR
In AMN, a self-attention mechanism is developed to enforce the newly introduced consistency constraint in order to preserve the self-correlation between the visual cues of the original video clips in the learned multimodal representations.
BERT Representations for Video Question Answering
TLDR
This work proposes to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips to capture the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pretrained language-based Transformer.
Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering
TLDR
A novel framework named Dual Hierarchical Temporal Convolutional Network (DHTCN) is proposed to address the aforementioned defects together and obtains state-of-the-art results on the both datasets.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 40 REFERENCES
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
TLDR
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
TLDR
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
Pororobot: A Deep Learning Robot That Plays Video Q&A Games
TLDR
This paper proposes a prototype system for a video Q&A robot “Pororobot”, which uses the state-of-the-art machine learning methods such as a deep concept hierarchy model.
Multimodal Residual Learning for Visual QA
TLDR
This work presents Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning.
MovieQA: Understanding Stories in Movies through Question-Answering
TLDR
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
End-To-End Memory Networks
TLDR
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
TLDR
This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.
Automated Construction of Visual-Linguistic Knowledge via Concept Learning from Cartoon Videos
TLDR
This work presents the model of deep concept hierarchy (DCH) that enables the progressive abstraction of concept knowledge in multiple levels, and develops a stochastic method for graph construction, i.e. a graph Monte Carlo algorithm, to search efficiently the huge compositional space of the vision-language concepts.
...
1
2
3
4
...