Cross-Modal and Hierarchical Modeling of Video and Text
@inproceedings{Zhang2018CrossModalAH, title={Cross-Modal and Hierarchical Modeling of Video and Text}, author={Bowen Zhang and Hexiang Hu and Fei Sha}, booktitle={ECCV}, year={2018} }
Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities…
93 Citations
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
- Computer ScienceNeurIPS
- 2020
This paper proposes a Cooperative hierarchical Transformer to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities in real-world video-text tasks.
Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.
Video-Text Pre-training with Learned Regions
- Computer ScienceArXiv
- 2021
This work proposes a simple yet effective module for videotext representation learning, namely RegionLearner, which can take into account the structure of objects during pretraining on large-scale video-text pairs and is much more computationally efficient.
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
- Computer ScienceArXiv
- 2020
The HierArchical Multi-Modal EncodeR (HAMMER) is proposed that encodes a video at both the coarse-grained clip level and the fine- grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling.
Level-wise aligned dual networks for text–video retrieval
- Computer ScienceEURASIP Journal on Advances in Signal Processing
- 2022
Level-wise aligned dual networks (LADNs) for text– video retrieval uses four common latent spaces to improve the performance of text–video retrieval and utilizes the semantic concept space to increase the interpretability of the model.
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
- Computer ScienceArXiv
- 2021
A novel High-resolution and Diversified VI deo- LA nguage pre-training model (HD-VILA) for many visual tasks that achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks.
A comprehensive review of the video-to-text problem
- Computer ScienceArtif. Intell. Rev.
- 2022
This paper reviews the video-to-text problem, in which the goal is to associate an input video with its textual description, and categorizes and describes the state-of-the-art techniques.
Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language
- Computer ScienceECCV
- 2020
A novel TALL method is proposed which builds a Hierarchical Visual-Textual Graph to model interactions between the objects and words as well as among the objects to jointly understand the video contents and the language.
Video and Text Matching with Conditioned Embeddings
- Computer Science2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
- 2022
This work encode the dataset data in a way that takes into account the query’s relevant information, and achieves state-of-the-art results for both sentence-clip and video-text by a sizable margin across five different datasets: ActivityNet, DiDeMo, YouCook2, MSR-VTT, and LSMDC.
Dual Encoding for Video Retrieval by Text.
- Computer ScienceIEEE transactions on pattern analysis and machine intelligence
- 2021
This paper proposes a dual deep encoding network that encodes videos and queries into powerful dense representations of their own and introduces hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space.
References
SHOWING 1-10 OF 53 REFERENCES
Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This work presents a hierarchical structured recurrent neural network (RNN), namely Hierarchical Multimodal LSTM (HM-LSTM), which exploits the hierarchical relations between sentences and phrases, and between whole images and image regions, to jointly establish their representations.
Localizing Moments in Video with Natural Language
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.
Video Summarization with Long Short-Term Memory
- Computer ScienceECCV
- 2016
Long Short-Term Memory (LSTM), a special type of recurrent neural networks are used to model the variable-range dependencies entailed in the task of video summarization to improve summarization by reducing the discrepancies in statistical properties across those datasets.
Learning Robust Visual-Semantic Embeddings
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
An end-to-end learning framework that is able to extract more robust multi-modal representations across domains and a novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data.
Sequence to Sequence -- Video to Text
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
A novel end- to-end sequence-to-sequence model to generate captions for videos that naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model.
Dense-Captioning Events in Videos
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This work proposes a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language, and introduces a new captioning module that uses contextual information from past and future events to jointly describe all events.
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
An approach that exploits hierarchical Recurrent Neural Networks to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video, significantly outperforms the current state-of-the-art methods.
Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This paper proposes a new approach, namely Hierarchical Recurrent Neural Encoder (HRNE), to exploit temporal information of videos to exploit video temporal structure in a longer range by reducing the length of input information flow, and compositing multiple consecutive inputs at a higher level.
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
- Computer ScienceArXiv
- 2014
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
Temporal Action Detection with Structured Segment Networks
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network…