Dense-Captioning Events in Videos

@article{Krishna2017DenseCaptioningEI,
  title={Dense-Captioning Events in Videos},
  author={Ranjay Krishna and Kenji Hata and Frederic Ren and Li Fei-Fei and Juan Carlos Niebles},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={706-715}
}
Most natural videos contain numerous events. [...] Key Method Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions…Expand
Jointly Localizing and Describing Events for Dense Video Captioning
TLDR
This paper presents a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner. Expand
Streamlined Dense Video Captioning
TLDR
A novel dense video captioning framework is proposed, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. Expand
An Efficient Framework for Dense Video Captioning
TLDR
This paper proposes a deep reinforcement-based approach which enables an agent to describe multiple events in a video by watching a portion of the frames, and reduces the computational cost by processing fewer frames while maintaining accuracy. Expand
Multi-modal Dense Video Captioning
  • Vladimir Iashin, Esa Rahtu
  • Computer Science, Engineering
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
  • 2020
TLDR
This paper shows how audio and speech modalities may improve a dense video captioning model and applies automatic speech recognition system to obtain a temporally aligned textual description of the speech and treats it as a separate input alongside video frames and the corresponding audio track. Expand
Weakly Supervised Dense Event Captioning in Videos
TLDR
This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training and presents a cycle system to train the model. Expand
Joint Event Detection and Description in Continuous Video Streams
TLDR
The Joint Event Detection and Description Network (JEDDi-Net) is proposed, which solves the dense video captioning task in an end-to-end fashion and presents the first dense captioning results on the TACoS-MultiLevel dataset. Expand
Video Captioning of Future Frames
  • M. Hosseinzadeh, Yang Wang
  • Computer Science
  • 2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2021
TLDR
The task of captioning future events to assess the performance of intelligent models on anticipation and video description generation tasks simultaneously is considered and it is demonstrated that the proposed method outperforms the baseline and is comparable to the oracle method. Expand
Critic-based Attention Network for Event-based Video Captioning
TLDR
Experimental results show that the actor-critic architecture for an event-based video captioning method outperforms state-of-the-artVideo captioning methods. Expand
Hierarchical Context Encoding for Events Captioning in Videos
  • Dali Yang, C. Yuan
  • Computer Science
  • 2018 25th IEEE International Conference on Image Processing (ICIP)
  • 2018
TLDR
This paper proposes a novel pipeline of captioning each event in one video (dense captioning in videos) and comes up with an encoder working along the time axis, which encodes videos and outputs features from different levels of hierarchical LSTMs. Expand
SODA: Story Oriented Dense Video Captioning Evaluation Framework
TLDR
A new evaluation framework, Story Oriented Dense video cAptioning evaluation framework (SODA), is proposed for measuring the performance of video story description systems and it is shown that SODA tends to give lower scores than the current evaluation framework in evaluating captions in the incorrect order. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 69 REFERENCES
Sequence to Sequence -- Video to Text
TLDR
A novel end- to-end sequence-to-sequence model to generate captions for videos that naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. Expand
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
TLDR
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset. Expand
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
TLDR
An approach that exploits hierarchical Recurrent Neural Networks to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video, significantly outperforms the current state-of-the-art methods. Expand
The Long-Short Story of Movie Description
TLDR
This work shows how to learn robust visual classifiers from the weak annotations of the sentence descriptions to generate a description using an LSTM and achieves the best performance to date on the challenging MPII-MD and M-VAD datasets. Expand
Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos
TLDR
This paper introduces a proposal method that aims to recover temporal segments containing actions in untrimmed videos and introduces a learning framework to represent and retrieve activity proposals. Expand
Temporal Localization of Actions with Actoms
TLDR
This work proposes a model based on a sequence of atomic action units, termed "actoms," that are semantically meaningful and characteristic for the action that outperforms the current state of the art in temporal action localization, as well as baselines that localize actions with a sliding window method. Expand
Describing Videos by Exploiting Temporal Structure
TLDR
This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Expand
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
TLDR
A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization. Expand
TVSum: Summarizing web videos using titles
TLDR
A novel co-archetypal analysis technique is developed that learns canonical visual concepts shared between video and images, but not in either alone, by finding a joint-factorial representation of two data sets. Expand
Dense Captioning with Joint Inference and Visual Context
TLDR
A new model pipeline based on two novel ideas, joint inference and context fusion, is proposed, which achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73% compared to the previous best algorithm. Expand
...
1
2
3
4
5
...