Object Relational Graph With Teacher-Recommended Learning for Video Captioning

@article{Zhang2020ObjectRG,
  title={Object Relational Graph With Teacher-Recommended Learning for Video Captioning},
  author={Ziqi Zhang and Yaya Shi and Chunfen Yuan and Bing Li and Peijin Wang and Weiming Hu and Zhengjun Zha},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={13275-13285}
}
  • Ziqi Zhang, Yaya Shi, Zhengjun Zha
  • Published 26 February 2020
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Taking full advantage of the information from both vision and language is critical for the video captioning task. Existing models lack adequate visual representation due to the neglect of interaction between object, and sufficient training for content-related words due to long-tailed problems. In this paper, we propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which… 

Figures and Tables from this paper

Multi-Level Visual Representation with Semantic-Reinforced Learning for Video Captioning
This paper describes our bronze-medal solution for the video captioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with
Discriminative Latent Semantic Graph for Video Captioning
TLDR
A novel Conditional Graph that can fuse spatio-temporal information into latent object proposal and a novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved.
Support-set based Multi-modal Representation Enhancement for Video Captioning
TLDR
This work proposes a novel and exible framework, namely Support-set based Multi-modal Representation Enhancement ( SMRE) model, to mine rich information in a semantic subspace shared between samples to obtain semantic-related visual elements in video captioning.
CLIP4Caption: CLIP for Video Caption
TLDR
A CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM) and adopts a Transformer structured decoder network to effectively learn the long-range visual and language dependency.
Hierarchical Modular Network for Video Captioning
TLDR
A hierarchical modular network to bridge video representations and linguistic semantics from three levels before generating captions is proposed and shows that the proposed method performs favorably against the state-ofthe-art models on the two widely-used benchmarks.
Dense Video Captioning with Early Linguistic Information Fusion
TLDR
A Visual-Semantic Embedding (ViSE) Framework is proposed that models the word(s)-context distributional properties over the entire semantic space and computes weights for all the n-grams such that higher weights are assigned to the more informative n- grams.
Cross-Modal Graph with Meta Concepts for Video Captioning
TLDR
This paper investigates an open research task of generating text descriptions for the given videos, and proposes Cross-Modal Graph (CMG) with meta concepts for video captioning, which weakly learns the corresponding visual regions for text descriptions, where the associated visual regions and textual words are named cross-modal meta concepts.
CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning
TLDR
Dual Attribute Prediction is introduced, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes and the co-occurrence relations between attributes.
Co-Segmentation Aided Two-Stream Architecture for Video Captioning
TLDR
A novel architecture that learns to attend to salient regions such as objects, persons automatically using a co-segmentation inspired attention module is proposed, and it is argued that using an external object detector could be eliminated if the model is equipped with the capability of automatically finding salient regions.
Teacher-Critical Training Strategies for Image Captioning
TLDR
A teacher model is introduced that serves as a bridge between the ground-truth caption and the caption model by generating some easier-to-learn word proposals as soft targets to facilitate better learning processes for the Caption model.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 58 REFERENCES
Exploring Visual Relationship for Image Captioning
TLDR
This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder.
Hierarchical Global-Local Temporal Modeling for Video Captioning
TLDR
A Hierarchical Temporal Model is proposed for the video captioning task, based on exploring the global and local temporal structure to better recognize fine-grained objects and actions and could generate more accurate descriptions in handling shot-switching problems.
Memory-Attended Recurrent Network for Video Captioning
TLDR
The Memory-Attended Recurrent Network (MARN) for video captioning is proposed, in which a memory structure is designed to explore the full-spectrum correspondence between a word and its various similar visual contexts across videos in training data.
M3: Multimodal Memory Modelling for Video Captioning
TLDR
A Multimodal Memory Model (M3) is proposed to describe videos, which builds a visual and textual shared memory to model the long-term visual-textual dependency and further guide visual attention on described visual targets to solve visual- Textual alignments.
Jointly Modeling Embedding and Translation to Bridge Video and Language
TLDR
A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.
Not All Words Are Equal: Video-specific Information Loss for Video Captioning
TLDR
A framework with hierarchical visual representations and an optimized hierarchical attention mechanism is established to capture the most salient spatial-temporal visual information, which fully exploits the potential strength of the proposed learning strategy.
Context-Aware Visual Policy Network for Fine-Grained Image Captioning
TLDR
A Context-Aware Visual Policy network (CAVP) is proposed for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning, which can attend to complex visual compositions over time.
Context-Aware Visual Policy Network for Sequence-Level Image Captioning
TLDR
A Context-Aware Visual Policy network (CAVP) for sequence-level image captioning that explicitly accounts for the previous visual attentions as the context, and then decides whether the context is helpful for the current word generation given the current visual attention.
Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
TLDR
This work argues that careful designing of visual features for this task is equally important, and presents a visual feature encoding technique to generate semantically rich captions using Gated Recurrent Units (GRUs).
Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning
TLDR
This work proposes a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos, and achieves substantially better performance than the state-of-the-art methods.
...
1
2
3
4
5
...