Dense-Captioning Events in Videos

  title={Dense-Captioning Events in Videos},
  author={Ranjay Krishna and Kenji Hata and Frederic Ren and Li Fei-Fei and Juan Carlos Niebles},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
Most natural videos contain numerous events. [] Key Method Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions…

Jointly Localizing and Describing Events for Dense Video Captioning

This paper presents a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner.

Streamlined Dense Video Captioning

A novel dense video captioning framework is proposed, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling.

An Efficient Framework for Dense Video Captioning

This paper proposes a deep reinforcement-based approach which enables an agent to describe multiple events in a video by watching a portion of the frames, and reduces the computational cost by processing fewer frames while maintaining accuracy.

Multi-modal Dense Video Captioning

This paper shows how audio and speech modalities may improve a dense video captioning model and applies automatic speech recognition system to obtain a temporally aligned textual description of the speech and treats it as a separate input alongside video frames and the corresponding audio track.

Weakly Supervised Dense Event Captioning in Videos

This paper formulates a new problem: weakly supervised dense event captioning, which does not require temporal segment annotations for model training and presents a cycle system to train the model.

Video Captioning of Future Frames

The task of captioning future events to assess the performance of intelligent models on anticipation and video description generation tasks simultaneously is considered and it is demonstrated that the proposed method outperforms the baseline and is comparable to the oracle method.

Critic-based Attention Network for Event-based Video Captioning

Experimental results show that the actor-critic architecture for an event-based video captioning method outperforms state-of-the-artVideo captioning methods.

Hierarchical Context Encoding for Events Captioning in Videos

  • Dali YangC. Yuan
  • Computer Science
    2018 25th IEEE International Conference on Image Processing (ICIP)
  • 2018
This paper proposes a novel pipeline of captioning each event in one video (dense captioning in videos) and comes up with an encoder working along the time axis, which encodes videos and outputs features from different levels of hierarchical LSTMs.

SODA: Story Oriented Dense Video Captioning Evaluation Framework

A new evaluation framework, Story Oriented Dense video cAptioning evaluation framework (SODA), is proposed for measuring the performance of video story description systems and it is shown that SODA tends to give lower scores than the current evaluation framework in evaluating captions in the incorrect order.

End-to-End Dense Video Captioning with Masked Transformer

This work proposes an end-to-end transformer model, which employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements.



Sequence to Sequence -- Video to Text

A novel end- to-end sequence-to-sequence model to generate captions for videos that naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model.

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.

Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks

An approach that exploits hierarchical Recurrent Neural Networks to tackle the video captioning problem, i.e., generating one or multiple sentences to describe a realistic video, significantly outperforms the current state-of-the-art methods.

The Long-Short Story of Movie Description

This work shows how to learn robust visual classifiers from the weak annotations of the sentence descriptions to generate a description using an LSTM and achieves the best performance to date on the challenging MPII-MD and M-VAD datasets.

Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos

This paper introduces a proposal method that aims to recover temporal segments containing actions in untrimmed videos and introduces a learning framework to represent and retrieve activity proposals.

Describing Videos by Exploiting Temporal Structure

This work proposes an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions and proposes a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN.

Dense Captioning with Joint Inference and Visual Context

A new model pipeline based on two novel ideas, joint inference and context fusion, is proposed, which achieves state-of-the-art accuracy on Visual Genome for dense captioning with a relative gain of 73% compared to the previous best algorithm.

Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research

An automatic DVS segmentation and alignment method for movies is described, that enables us to scale up the collection of a DVS-derived dataset with minimal human intervention.

Actions in context

This paper automatically discover relevant scene classes and their correlation with human actions, and shows how to learn selected scene classes from video without manual supervision and develops a joint framework for action and scene recognition and demonstrates improved recognition of both in natural video.

Automatic annotation of human actions in video

This paper addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision with a kernel-based discriminative clustering algorithm that locates actions in the weakly-labeled training data.