Syntax Customized Video Captioning by Imitating Exemplar Sentences
@article{Yuan2021SyntaxCV, title={Syntax Customized Video Captioning by Imitating Exemplar Sentences}, author={Yitian Yuan and Lin Ma and Wenwu Zhu}, journal={ArXiv}, year={2021}, volume={abs/2112.01062} }
Enhancing the diversity of sentences to describe video contents is an important problem arising in recent video captioning research. In this paper, we explore this problem from a novel perspective of customizing video captions by imitating exemplar sentence syntaxes. Specifically, given a video and any syntax-valid exemplar sentence, we introduce a new task of Syntax Customized Video Captioning (SCVC) aiming to generate one caption which not only semantically describes the video contents but…
Figures and Tables from this paper
References
SHOWING 1-10 OF 42 REFERENCES
Reconstruction Network for Video Captioning
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
A reconstruction network with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning, and can boost the encoding models and leads to significant gains in video caption accuracy.
Unsupervised Image Captioning
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
This paper makes the first attempt to train an image captioning model in an unsupervised manner, and requires an image set, a sentence corpus, and an existing visual concept detector.
Deep Learning for Video Captioning: A Review
- Computer ScienceIJCAI
- 2019
The problem of video captioning is formulated, state-of-the-art methods categorized by their emphasis on vision or language are reviewed, and followed by a summary of standard datasets and representative approaches are reviewed.
Weakly Supervised Dense Event Captioning in Videos
- Computer ScienceNeurIPS
- 2018
This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training and presents a cycle system to train the model.
Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
A gating strategy is proposed to dynamically and adaptively incorporate the global syntactic POS information into the decoder for generating each word, which not only boosts the video captioning performance but also improves the diversity of the generated captions.
Diverse Video Captioning Through Latent Variable Expansion with Conditional GAN
- Computer ScienceArXiv
- 2019
This paper aims to caption each video with multiple descriptions and proposes a novel framework that demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.
Learning Multimodal Attention LSTM Networks for Video Captioning
- Computer ScienceACM Multimedia
- 2017
This work presents a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM), and designs a novel child-sum fusion unit in the MA-L STM to effectively combine different encoded modalities to the initial decoding states.
Hierarchical Boundary-Aware Neural Encoder for Video Captioning
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
A novel LSTM cell is proposed which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly and can discover and leverage the hierarchical structure of the video.
StyleNet: Generating Attractive Visual Captions with Styles
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
StyleNet outperforms existing approaches for generating visual captions with different styles, measured in both automatic and human evaluation metrics on the newly collected FlickrStyle10K image caption dataset, which contains 10K Flickr images with corresponding humorous and romantic captions.
Jointly Modeling Embedding and Translation to Bridge Video and Language
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.