A Read-Write Memory Network for Movie Story Understanding

@article{Na2017ARM,
  title={A Read-Write Memory Network for Movie Story Understanding},
  author={Seil Na and Sangho Lee and Jisung Kim and Gunhee Kim},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={677-685}
}
  • Seil Na, Sangho Lee, Gunhee Kim
  • Published 27 September 2017
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
We propose a novel memory network model named Read-Write Memory Network (RWMN) to perform question and answering tasks for large-scale, multimodal movie story understanding. The key focus of our RWMN model is to design the read network and the write network that consist of multiple convolutional layers, which enable memory read and write operations to have high capacity and flexibility. While existing memory-augmented network models treat each memory slot as an independent block, our use of… 

Figures and Tables from this paper

A Memory Network Approach for Story-Based Temporal Summarization of 360° Videos
TLDR
This work proposes a novel memory network model named Past-Future Memory Network (PFMN), in which the scores of 81 normal field of view (NFOV) region proposals cropped from the input 360° video are computed and a latent, collective summary is recovered using the network with two external memories that store the embeddings of previously selected subshots and future candidate subshots.
Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents
TLDR
A Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module is put forward, which achieves the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'.
Movie Question Answering via Textual Memory and Plot Graph
TLDR
A new dataset called PlotGraphs is introduced, as external knowledge, which contains massive graph-based information of movies, and a model that can utilize movie clip, subtitle, and graph- based external knowledge is put forward.
Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data
TLDR
A novel end-to-end deep network model for reading comprehension called Episodic Memory Reader (EMR) that sequentially reads the input contexts into an external memory, while replacing memories that are less important for answering unseen questions is proposed.
Progressive Attention Memory Network for Movie Story Question Answering
TLDR
Experiments on publicly available benchmark datasets, MovieQA and TVQA, demonstrate that each feature contributes to the movie story QA architecture, PAMN, and improves performance to achieve the state-of-the-art result.
Multimodal Dual Attention Memory for Video Story Question Answering
TLDR
The best performance of the dual attention mechanism combined with late fusion by ablation studies are confirmed and MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models.
Motion-Appearance Co-memory Networks for Video Question Answering
TLDR
The proposed motion-appearance co-memory network is built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA and outperform state-of-the-art significantly on all four tasks of TGIF-QA.
Triple Attention Network architecture for MovieQA
TLDR
Experiments show that the inclusion of audio using the triple-attention network results provides complementary information for Movie QA task which is not captured by visual or textual component in the data.
Graph-Based Multi-Interaction Network for Video Question Answering
TLDR
A graph-based relation-aware neural network is proposed to explore a more fine-grained visual representation, which could explore the relationships and dependencies between objects spatially and temporally in videos.
Extractive Video Summarizer with Memory Augmented Neural Networks
TLDR
A memory augmented extractive video summarizer, which utilizes an external memory to record visual information of the whole video with high capacity and demonstrates that the global attention modeling has two advantages: good transferring ability across datasets and high robustness to noisy videos.
...
...

References

SHOWING 1-10 OF 40 REFERENCES
DeepStory: Video Story QA by Deep Embedded Memory Networks
TLDR
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models.
Attend to You: Personalized Image Captioning with Context Sequence Memory Networks
TLDR
This work proposes a novel captioning model named Context Sequence Memory Network (CSMN), and shows the effectiveness of the three novel features of CSMN and its performance enhancement for personalized image captioning over state-of-the-art captioning models.
Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes
TLDR
This work presents an end-to-end differentiable memory access scheme, which they call Sparse Access Memory (SAM), that retains the representational power of the original approaches whilst training efficiently with very large memories, and achieves asymptotic lower bounds in space and time complexity.
End-To-End Memory Networks
TLDR
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
TLDR
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering
TLDR
A high-level concept word detector that can be integrated with any video-to-language models and which develops a semantic attention mechanism that selectively focuses on the detected concept words and fuse them with the word encoding and decoding in the language model.
Memory Networks
TLDR
This work describes a new class of learning models called memory networks, which reason with inference components combined with a long-term memory component; they learn how to use these jointly.
Key-Value Memory Networks for Directly Reading Documents
TLDR
This work introduces a new method, Key-Value Memory Networks, that makes reading documents more viable by utilizing different encodings in the addressing and output stages of the memory read operation.
Hierarchical Memory Networks
TLDR
A form of hierarchical memory network is explored, which can be considered as a hybrid between hard and soft attention memory networks, and is organized in a hierarchical structure such that reading from it is done with less computation than soft attention over a flat memory, while also being easier to train than hard attention overA flat memory.
MovieQA: Understanding Stories in Movies through Question-Answering
TLDR
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
...
...