Video Moment Localization using Object Evidence and Reverse Captioning
@article{Vidanapathirana2020VideoML, title={Video Moment Localization using Object Evidence and Reverse Captioning}, author={Madhawa Vidanapathirana and Supriya Pandhre and Sonia Raychaudhuri and Anjali Khurana}, journal={ArXiv}, year={2020}, volume={abs/2006.10260} }
We address the problem of language-based temporal localization of moments in untrimmed videos. Compared to temporal localization with fixed categories, this problem is more challenging as the language-based queries have no predefined activity classes and may also contain complex descriptions. Current state-of-the-art model MAC addresses it by mining activity concepts from both video and language modalities. This method encodes the semantic activity concepts from the verb/object pair in a…
One Citation
Efficient Proposal Generation with U-shaped Network for Temporal Sentence Grounding
- Computer ScienceMMAsia
- 2021
A novel temporal sentence grounding model with an U- shaped Network for efficient proposal generation (UN-TSG), which utilizes U-shaped structure to encode proposals of different lengths hierarchically.
References
SHOWING 1-7 OF 7 REFERENCES
MAC: Mining Activity Concepts for Language-Based Temporal Localization
- Computer Science2019 IEEE Winter Conference on Applications of Computer Vision (WACV)
- 2019
The novel ACL encodes the semantic concepts from verb-obj pairs in language queries and leverages activity classifiers' prediction scores to encode visual concepts, and shows that ACL significantly outperforms state-of-the-arts under the widely used metric.
TALL: Temporal Activity Localization via Language Query
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
A novel Cross-modal Temporal Regression Localizer (CTRL) is proposed to jointly model text query and video clips, output alignment scores and action boundary regression results for candidate clips, and Experimental results show that CTRL outperforms previous methods significantly on both datasets.
Scene Parsing through ADE20K Dataset
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
The ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, is introduced and it is shown that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis.
TSM: Temporal Shift Module for Efficient Video Understanding
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D CNN’s complexity and is extended to online setting, which enables real-time low-latency online video recognition and video object detection.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Computer ScienceArXiv
- 2019
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Computer ScienceNAACL
- 2019
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
C3D: Generic Features for Video Analysis
- Computer ScienceArXiv
- 2014
Convolution 3D feature is proposed, a generic spatio-temporal feature obtained by training a deep 3-dimensional convolutional network on a large annotated video dataset comprising objects, scenes, actions, and other frequently occurring concepts that encapsulate appearance and motion cues and perform well on different video classification tasks.