Dense-Captioning Events in Videos
@article{Krishna2017DenseCaptioningEI, title={Dense-Captioning Events in Videos}, author={Ranjay Krishna and Kenji Hata and Frederic Ren and Li Fei-Fei and Juan Carlos Niebles}, journal={2017 IEEE International Conference on Computer Vision (ICCV)}, year={2017}, pages={706-715} }
Most natural videos contain numerous events. [] Key Method Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions…
Figures and Tables from this paper
607 Citations
Jointly Localizing and Describing Events for Dense Video Captioning
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
This paper presents a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner.
Streamlined Dense Video Captioning
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
A novel dense video captioning framework is proposed, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling.
An Efficient Framework for Dense Video Captioning
- Computer ScienceAAAI
- 2020
This paper proposes a deep reinforcement-based approach which enables an agent to describe multiple events in a video by watching a portion of the frames, and reduces the computational cost by processing fewer frames while maintaining accuracy.
Multi-modal Dense Video Captioning
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
- 2020
This paper shows how audio and speech modalities may improve a dense video captioning model and applies automatic speech recognition system to obtain a temporally aligned textual description of the speech and treats it as a separate input alongside video frames and the corresponding audio track.
Weakly Supervised Dense Event Captioning in Videos
- Computer ScienceNeurIPS
- 2018
This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training and presents a cycle system to train the model.
Video Captioning of Future Frames
- Computer Science2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
- 2021
The task of captioning future events to assess the performance of intelligent models on anticipation and video description generation tasks simultaneously is considered and it is demonstrated that the proposed method outperforms the baseline and is comparable to the oracle method.
Critic-based Attention Network for Event-based Video Captioning
- Computer ScienceACM Multimedia
- 2019
Experimental results show that the actor-critic architecture for an event-based video captioning method outperforms state-of-the-artVideo captioning methods.
Hierarchical Context Encoding for Events Captioning in Videos
- Computer Science2018 25th IEEE International Conference on Image Processing (ICIP)
- 2018
This paper proposes a novel pipeline of captioning each event in one video (dense captioning in videos) and comes up with an encoder working along the time axis, which encodes videos and outputs features from different levels of hierarchical LSTMs.
SODA: Story Oriented Dense Video Captioning Evaluation Framework
- Computer ScienceECCV
- 2020
A new evaluation framework, Story Oriented Dense video cAptioning evaluation framework (SODA), is proposed for measuring the performance of video story description systems and it is shown that SODA tends to give lower scores than the current evaluation framework in evaluating captions in the incorrect order.
End-to-End Dense Video Captioning with Masked Transformer
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
This work proposes an end-to-end transformer model, which employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements.
References
SHOWING 1-10 OF 67 REFERENCES
Sequence to Sequence -- Video to Text
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
A novel end- to-end sequence-to-sequence model to generate captions for videos that naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model.
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
An approach that exploits hierarchical Recurrent Neural Networks to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video, significantly outperforms the current state-of-the-art methods.
The Long-Short Story of Movie Description
- Computer ScienceGCPR
- 2015
This work shows how to learn robust visual classifiers from the weak annotations of the sentence descriptions to generate a description using an LSTM and achieves the best performance to date on the challenging MPII-MD and M-VAD datasets.
Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This paper introduces a proposal method that aims to recover temporal segments containing actions in untrimmed videos and introduces a learning framework to represent and retrieve activity proposals.
Describing Videos by Exploiting Temporal Structure
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.
Dense Captioning with Joint Inference and Visual Context
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
A new model pipeline based on two novel ideas, joint inference and context fusion, is proposed, which achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73% compared to the previous best algorithm.
Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research
- Computer ScienceArXiv
- 2015
An automatic DVS segmentation and alignment method for movies is described, that enables us to scale up the collection of a DVS-derived dataset with minimal human intervention.
Actions in context
- Computer Science2009 IEEE Conference on Computer Vision and Pattern Recognition
- 2009
This paper automatically discover relevant scene classes and their correlation with human actions, and shows how to learn selected scene classes from video without manual supervision and develops a joint framework for action and scene recognition and demonstrates improved recognition of both in natural video.
Automatic annotation of human actions in video
- Computer Science2009 IEEE 12th International Conference on Computer Vision
- 2009
This paper addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision with a kernel-based discriminative clustering algorithm that locates actions in the weakly-labeled training data.