• Corpus ID: 244798729

Syntax Customized Video Captioning by Imitating Exemplar Sentences

  title={Syntax Customized Video Captioning by Imitating Exemplar Sentences},
  author={Yitian Yuan and Lin Ma and Wenwu Zhu},
Enhancing the diversity of sentences to describe video contents is an important problem arising in recent video captioning research. In this paper, we explore this problem from a novel perspective of customizing video captions by imitating exemplar sentence syntaxes. Specifically, given a video and any syntax-valid exemplar sentence, we introduce a new task of Syntax Customized Video Captioning (SCVC) aiming to generate one caption which not only semantically describes the video contents but… 

Figures and Tables from this paper


Reconstruction Network for Video Captioning
A reconstruction network with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning, and can boost the encoding models and leads to significant gains in video caption accuracy.
Unsupervised Image Captioning
  • Yang Feng, Lin Ma, Wei Liu, Jiebo Luo
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This paper makes the first attempt to train an image captioning model in an unsupervised manner, and requires an image set, a sentence corpus, and an existing visual concept detector.
Deep Learning for Video Captioning: A Review
The problem of video captioning is formulated, state-of-the-art methods categorized by their emphasis on vision or language are reviewed, and followed by a summary of standard datasets and representative approaches are reviewed.
Weakly Supervised Dense Event Captioning in Videos
This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training and presents a cycle system to train the model.
Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network
A gating strategy is proposed to dynamically and adaptively incorporate the global syntactic POS information into the decoder for generating each word, which not only boosts the video captioning performance but also improves the diversity of the generated captions.
Diverse Video Captioning Through Latent Variable Expansion with Conditional GAN
This paper aims to caption each video with multiple descriptions and proposes a novel framework that demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.
Learning Multimodal Attention LSTM Networks for Video Captioning
This work presents a novel deep framework to boost video captioning by learning Multimodal Attention Long-Short Term Memory networks (MA-LSTM), and designs a novel child-sum fusion unit in the MA-L STM to effectively combine different encoded modalities to the initial decoding states.
Hierarchical Boundary-Aware Neural Encoder for Video Captioning
A novel LSTM cell is proposed which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly and can discover and leverage the hierarchical structure of the video.
StyleNet: Generating Attractive Visual Captions with Styles
StyleNet outperforms existing approaches for generating visual captions with different styles, measured in both automatic and human evaluation metrics on the newly collected FlickrStyle10K image caption dataset, which contains 10K Flickr images with corresponding humorous and romantic captions.
Jointly Modeling Embedding and Translation to Bridge Video and Language
A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.