Spatio-Temporal Graph for Video Captioning With Knowledge Distillation

  title={Spatio-Temporal Graph for Video Captioning With Knowledge Distillation},
  author={Boxiao Pan and Haoye Cai and De-An Huang and Kuan-Hui Lee and Adrien Gaidon and Ehsan Adeli and Juan Carlos Niebles},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds… 

Figures and Tables from this paper

Motion Guided Region Message Passing for Video Captioning
A Recurrent Region Attention module is proposed to better extract diverse spatial features, and by employing Motion-Guided Cross-frame Message Passing, this model is aware of the temporal structure and able to establish high-order relations among the diverse regions across frames.
GL-RG: Global-Local Representation Granularity for Video Captioning
The proposed GL-RG framework for video captioning, namely a Global-Local Representation Granularity, demonstrates three advantages over the prior efforts: it explicitly exploit extensive visual representations from different video ranges to improve linguistic expression, and develops an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning behavior.
Co-Segmentation Aided Two-Stream Architecture for Video Captioning
A novel architecture that learns to attend to salient regions such as objects, persons automatically using a co-segmentation inspired attention module is proposed, and it is argued that using an external object detector could be eliminated if the model is equipped with the capability of automatically finding salient regions.
Dense Video Captioning with Early Linguistic Information Fusion
A Visual-Semantic Embedding (ViSE) Framework is proposed that models the word(s)-context distributional properties over the entire semantic space and computes weights for all the n-grams such that higher weights are assigned to the more informative n- grams.
Discriminative Latent Semantic Graph for Video Captioning
A novel Conditional Graph that can fuse spatio-temporal information into latent object proposal and a novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved.
Cross-Modal Graph with Meta Concepts for Video Captioning
This paper investigates an open research task of generating text descriptions for the given videos, and proposes Cross-Modal Graph (CMG) with meta concepts for video captioning, which weakly learns the corresponding visual regions for text descriptions, where the associated visual regions and textual words are named cross-modal meta concepts.
Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
This paper describes our bronze-medal solution for the video captioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with
Support-set based Multi-modal Representation Enhancement for Video Captioning
This work proposes a novel and exible framework, namely Support-set based Multi-modal Representation Enhancement ( SMRE) model, to mine rich information in a semantic subspace shared between samples to obtain semantic-related visual elements in video captioning.
Visual-aware Attention Dual-stream Decoder for Video Captioning
A new Visual-aware Attention Dual-stream Decoder (VADD) model is proposed, which concatenates dynamic changes of temporal sequence frames with the words at the previous moment, as the input of attention mechanism to extract sequence features.
CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning
Dual Attribute Prediction is introduced, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes and the co-occurrence relations between attributes.


Object-Aware Aggregation With Bidirectional Temporal Graph for Video Captioning
  • Junchao Zhang, Yuxin Peng
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This paper proposes a new video captioning approach based on object-aware aggregation with bidirectional temporal graph (OA-BTG), which captures detailed temporal dynamics for salient objects in video, and learns discriminative spatio-temporal representations by performing object- aware local feature aggregation on detected object regions.
Memory-Attended Recurrent Network for Video Captioning
The Memory-Attended Recurrent Network (MARN) for video captioning is proposed, in which a memory structure is designed to explore the full-spectrum correspondence between a word and its various similar visual contexts across videos in training data.
Attend and Interact: Higher-Order Object Interactions for Video Understanding
It is demonstrated that modeling object interactions significantly improves accuracy for both action recognition and video captioning, while saving more than 3-times the computation over traditional pairwise relationships.
Videos as Space-Time Region Graphs
The proposed graph representation achieves state-of-the-art results on the Charades and Something-Something datasets and obtains a huge gain when the model is applied in complex environments.
Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding
The generality of the approach is demonstrated on a variety of tasks, such as temporal subactivity classification and object affordance classification on the CAD-120 dataset and multilabel temporal action localization on the large scale Charades dataset, where it outperform existing deep learning approaches, using only raw RGB frames.
Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning
This work proposes a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos, and achieves substantially better performance than the state-of-the-art methods.
Spatio-Temporal Action Graph Networks
This work proposes a novel inter-object graph representation for activity recognition based on a disentangled graph embedding with direct observation of edge appearance, and offers significantly improved performance compared to baseline approaches without object-graph representations, or with previous graph-based models.
TVT: Two-View Transformer Network for Video Captioning
This paper introduces a novel video captioning framework, i.e., Two-View Transformer (TVT), which comprises of a backbone of Transformer network for sequential representation and two types of fusion blocks in decoder layers for combining different modalities effectively.
Classifying Collisions with Spatio-Temporal Action Graph Networks
It is shown that a new model for explicit representation of object interactions significantly improves deep video activity classification for driving collision detection and proposes a Spatio-Temporal Action Graph (STAG) network, which incorporates spatial and temporal relations of objects.
Grounded Video Description
A novel video description model is proposed which is able to exploit bounding box annotations and achieves state-of-the-art performance on video description, video paragraph description, and image description and demonstrates the authors' generated sentences are better grounded in the video.