• Corpus ID: 219792504

Video Moment Localization using Object Evidence and Reverse Captioning

@article{Vidanapathirana2020VideoML,
  title={Video Moment Localization using Object Evidence and Reverse Captioning},
  author={Madhawa Vidanapathirana and Supriya Pandhre and Sonia Raychaudhuri and Anjali Khurana},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.10260}
}
We address the problem of language-based temporal localization of moments in untrimmed videos. Compared to temporal localization with fixed categories, this problem is more challenging as the language-based queries have no predefined activity classes and may also contain complex descriptions. Current state-of-the-art model MAC addresses it by mining activity concepts from both video and language modalities. This method encodes the semantic activity concepts from the verb/object pair in a… 

Figures and Tables from this paper

Efficient Proposal Generation with U-shaped Network for Temporal Sentence Grounding

A novel temporal sentence grounding model with an U- shaped Network for efficient proposal generation (UN-TSG), which utilizes U-shaped structure to encode proposals of different lengths hierarchically.

References

SHOWING 1-7 OF 7 REFERENCES

MAC: Mining Activity Concepts for Language-Based Temporal Localization

The novel ACL encodes the semantic concepts from verb-obj pairs in language queries and leverages activity classifiers' prediction scores to encode visual concepts, and shows that ACL significantly outperforms state-of-the-arts under the widely used metric.

TALL: Temporal Activity Localization via Language Query

A novel Cross-modal Temporal Regression Localizer (CTRL) is proposed to jointly model text query and video clips, output alignment scores and action boundary regression results for candidate clips, and Experimental results show that CTRL outperforms previous methods significantly on both datasets.

Scene Parsing through ADE20K Dataset

The ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, is introduced and it is shown that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis.

TSM: Temporal Shift Module for Efficient Video Understanding

A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D CNN’s complexity and is extended to online setting, which enables real-time low-latency online video recognition and video object detection.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

C3D: Generic Features for Video Analysis

Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.