Cross-Modal and Hierarchical Modeling of Video and Text

@inproceedings{Zhang2018CrossModalAH,
  title={Cross-Modal and Hierarchical Modeling of Video and Text},
  author={Bowen Zhang and Hexiang Hu and Fei Sha},
  booktitle={ECCV},
  year={2018}
}
Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities… 
HANet: Hierarchical Alignment Networks for Video-Text Retrieval
TLDR
The proposed Hierarchical Alignment Network (HANet) to align different level representations for video-text matching outperforms other state-of-the-art methods, which demonstrates the effectiveness of hierarchical representation and alignment.
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
TLDR
This paper proposes a Cooperative hierarchical Transformer to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities in real-world video-text tasks.
Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning
TLDR
A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.
Video-Text Pre-training with Learned Regions
TLDR
This work proposes a simple yet effective module for videotext representation learning, namely RegionLearner, which can take into account the structure of objects during pretraining on large-scale video-text pairs and is much more computationally efficient.
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
TLDR
The HierArchical Multi-Modal EncodeR (HAMMER) is proposed that encodes a video at both the coarse-grained clip level and the fine- grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling.
A comprehensive review of the video-to-text problem
TLDR
This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description, and categorizes and describes the state-of-the-art techniques.
Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language
TLDR
A novel TALL method is proposed which builds a Hierarchical Visual-Textual Graph to model interactions between the objects and words as well as among the objects to jointly understand the video contents and the language.
Video and Text Matching with Conditioned Embeddings
TLDR
This work encode the dataset data in a way that takes into account the query’s relevant information, and achieves state-of-the-art results for both sentence-clip and video-text by a sizable margin across five different datasets: ActivityNet, DiDeMo, YouCook2, MSR-VTT, and LSMDC.
Dual Encoding for Video Retrieval by Text.
TLDR
This paper proposes a dual deep encoding network that encodes videos and queries into powerful dense representations of their own and introduces hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space.
Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review
TLDR
This review categorizes and describes the state-of-the-art techniques for the video-to-text problem and analyzes how the most reported benchmark datasets have been created, showing their drawbacks and strengths for the problem requirements.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
Jointly Localizing and Describing Events for Dense Video Captioning
TLDR
This paper presents a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner.
Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
TLDR
This work presents a hierarchical structured recurrent neural network (RNN), namely Hierarchical Multimodal LSTM (HM-LSTM), which exploits the hierarchical relations between sentences and phrases, and between whole images and image regions, to jointly establish their representations.
Localizing Moments in Video with Natural Language
TLDR
The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.
Video Summarization with Long Short-Term Memory
TLDR
Long Short-Term Memory (LSTM), a special type of recurrent neural networks are used to model the variable-range dependencies entailed in the task of video summarization to improve summarization by reducing the discrepancies in statistical properties across those datasets.
Learning Robust Visual-Semantic Embeddings
TLDR
An end-to-end learning framework that is able to extract more robust multi-modal representations across domains and a novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data.
Sequence to Sequence -- Video to Text
TLDR
A novel end- to-end sequence-to-sequence model to generate captions for videos that naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model.
Dense-Captioning Events in Videos
TLDR
This work proposes a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, and introduces a new captioning module that uses contextual information from past and future events to jointly describe all events.
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
TLDR
An approach that exploits hierarchical Recurrent Neural Networks to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video, significantly outperforms the current state-of-the-art methods.
Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
TLDR
This paper proposes a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos to exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level.
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
TLDR
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
...
...