End-to-End Dense Video Captioning with Parallel Decoding

  title={End-to-End Dense Video Captioning with Parallel Decoding},
  author={Teng Wang and Ruimao Zhang and Zhichao Lu and Feng Zheng and Ran Cheng and Ping Luo},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
  • Teng Wang, Ruimao Zhang, Ping Luo
  • Published 17 August 2021
  • Computer Science
  • 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated "localizethen-describe" scheme, which heavily relies on numerous hand-crafted components. In this paper, we proposed a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. In practice, through stacking a newly proposed event… 

End-to-end Dense Video Captioning as Sequence Generation

This work shows how to model the two 013 subtasks of dense video captioning jointly as one sequence generation task, and simultane- 015 predict the events and the corresponding 016 descriptions.

PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning

This work presents a semantic-assisted dense video captioning model based on the encoding-decoding framework that achieves significant improvements on the YouMakeup dataset and achieves high performance in the Makeup Dense Video Captioning (MDVC) task of PIC 4th Challenge.

CapOnImage: Context-driven Dense-Captioning on Image

This work introduces a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information, and proposes a multi-modal pre-training model with multi-level pre- training tasks that progressively learn the correspondence between texts and image locations from easy to difficult.

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

Event detection is defined as a sequence generation task and a unified pre-training and fine-tuning framework is proposed to naturally enhance the inter-task association between event detection and captioning to detect more diverse and consistent events in the video.

Structured Stochastic Recurrent Network for Linguistic Video Prediction

This work introduces a new task of Linguistic Video Prediction (LVP), which aims to predict the forthcoming events based on past video content and generate corresponding linguistic descriptions, and proposes an end-to-end probabilistic approach named structured stochastic recurrent network (SRN) to characterize the one- to-many connections between past visual clues and possible future events.

VRDFormer: End-to-End Video Visual Relation Detection with Transformers

  • S. ZhengQin Jin
  • Computer Science
    2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2022
This paper proposes a transformer-based framework called VRDFormer, which achieves the state-of-the-art performance on both relation detection and relation tagging tasks and exploits a query-based approach to autoregressively generate relation instances.

VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that the proposed VLCap outperforms existing SOTA methods on both accuracy and diversity metrics.

GEB+: A benchmark for generic event boundary captioning, grounding and text-based retrieval

This paper introduces a new dataset called Kinetic-GEBC (Generic Event Boundary Captioning), consisting of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos, which can drive developing more powerful methods to understand status changes and thus improve video level comprehension.

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

This paper introduces a new dataset called Kinetic-GEB+, which consists of over 170k boundaries associated with captions describing status changes in the generic events in 12K videos and proposes three tasks supporting the development of a more fine-grained, robust, and human-like understanding of videos through status changes.

Exploiting Context Information for Generic Event Boundary Captioning

A model that directly takes the whole video as input and generates captions for all boundaries paral-lelly and could learn the context information for each time boundary by modeling the boundary-boundary interactions.



End-to-End Dense Video Captioning with Masked Transformer

This work proposes an end-to-end transformer model, which employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements.

Reconstruction Network for Video Captioning

A reconstruction network with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning, and can boost the encoding models and leads to significant gains in video caption accuracy.

Multi-modal Dense Video Captioning

This paper shows how audio and speech modalities may improve a dense video captioning model and applies automatic speech recognition system to obtain a temporally aligned textual description of the speech and treats it as a separate input alongside video frames and the corresponding audio track.

Streamlined Dense Video Captioning

A novel dense video captioning framework is proposed, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling.

An Efficient Framework for Dense Video Captioning

This paper proposes a deep reinforcement-based approach which enables an agent to describe multiple events in a video by watching a portion of the frames, and reduces the computational cost by processing fewer frames while maintaining accuracy.

Memory-Attended Recurrent Network for Video Captioning

The Memory-Attended Recurrent Network (MARN) for video captioning is proposed, in which a memory structure is designed to explore the full-spectrum correspondence between a word and its various similar visual contexts across videos in training data.

Jointly Localizing and Describing Events for Dense Video Captioning

This paper presents a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner.

Event-Centric Hierarchical Representation for Dense Video Captioning

A novel event-centric hierarchical representation is proposed to enhance the event-level representation by capturing rich relationship between events in terms of both temporal structure and semantic meaning and proposes a duplicate removal method, namely temporal-linguistic non-maximum suppression (TL-NMS) to distinguish redundancy in both localization and captioning stages.

Hierarchical Context Encoding for Events Captioning in Videos

  • Dali YangC. Yuan
  • Computer Science
    2018 25th IEEE International Conference on Image Processing (ICIP)
  • 2018
This paper proposes a novel pipeline of captioning each event in one video (dense captioning in videos) and comes up with an encoder working along the time axis, which encodes videos and outputs features from different levels of hierarchical LSTMs.

Video Captioning With Attention-Based LSTM and Semantic Consistency

A novel end-to-end framework named aLSTMs, an attention-based LSTM model with semantic consistency, to transfer videos to natural sentences with competitive or even better results than the state-of-the-art baselines for video captioning in both BLEU and METEOR.