• Corpus ID: 236133968

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

@inproceedings{Lei2021QVHighlightsDM,
  title={QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries},
  author={Jie Lei and Tamara L. Berg and Mohit Bansal},
  booktitle={NeurIPS},
  year={2021}
}
Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVH IGHLIGHTS ) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos… 
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection
TLDR
This paper presents the first framework to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task, and tackles moment retrieval as a keypoint detection problem using a novel query generator and query decoder.
The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions
TLDR
This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions, and constructs a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses.
AssistSR: Affordance-centric Question-driven Video Segment Retrieval
TLDR
A straightforward yet effective model called Dual Multimodal Encoders (DME) is developed that significantly outperforms several baseline methods while still having large room for improvement in the future.
ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022
TLDR
This report proposes a multi-scale cross-modal transformer and a video frame-level contrastive loss to fully uncover the correlation between language queries and video clips and proposes two data augmentation strategies to increase the diversity of training samples.
AssistSR: Task-oriented Question-driven Video Segment Retrieval
TLDR
A straightforward yet effective model called Dual Multimodal Encoders (DME) that significantly outperforms several baseline methods while still having large room for improvement in the future is developed.
Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video
TLDR
A visual-prompt text span localizing (VPTSL) method, which introduces the timestamped subtitles as a passage to perform the text span localization for the input text question, and prompts the visual highlight features into the pre-trained language model (PLM) for enhancing the joint semantic representations.
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
TLDR
This work proposes to replace parts of the video with compact audio cues that succinctly summarize dynamic audio events and are cheap to process, and achieves better text-to-video retrieval accuracy on several diverse long-range video datasets such as ActivityNet, QVHighlights, YouCook2, DiDeMo and Charades.
TubeDETR: Spatio-Temporal Video Grounding with Transformers
TLDR
TubeDETR is proposed, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection that includes an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a space-time decoder that jointly performs spatio-temporal localization.
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
TLDR
This work presents VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs, and designs a new pretraining task, Masked Visual-token Modeling (MVM), for better video modeling.

References

SHOWING 1-10 OF 54 REFERENCES
Localizing Moments in Video with Natural Language
TLDR
The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.
Find and Focus: Retrieve and Localize Video Events with Natural Language Queries
TLDR
This work proposes a new framework called Find and Focus (FIFO), which not only performs top-level matching (paragraph vs. video), but also makes part-level associations, localizing a video clip for each sentence in the query with the help of a focusing guide.
Temporal Localization of Moments in Video Collections with Natural Language
TLDR
The CAL model outperforms the recently proposed Moment Context Network on all criteria across all datasets on the proposed task, obtaining an 8%-85% and 11%-47% boost for average recall and median rank, respectively, and achieves 5x faster retrieval and 8x smaller index size with a 500K video corpus.
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
TLDR
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language
TLDR
A Temporal Adjacent Network (2D-TAN) is proposed, a single-shot framework for moment localization that is capable of encoding the adjacent temporal relation, while learning discriminative features for matching video moments with referring expressions.
Ranking Domain-Specific Highlights by Analyzing Edited Videos
TLDR
This work presents a fully automatic system for ranking domain-specific highlights in unconstrained personal videos by analyzing online edited videos and shows that impressive highlights can be retrieved without additional human supervision for domains like skating, surfing, skiing, gymnastics, parkour, and dog activity.
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
TLDR
The proposed XML model uses a late fusion design with a novel Convolutional Start-End detector (ConvSE), surpassing baselines by a large margin and with better efficiency, providing a strong starting point for future work.
Temporal Query Networks for Fine-grained Video Understanding
TLDR
A new model is proposed—a Temporal Query Network—which enables the query-response functionality, and a structural understanding of fine-grained actions in untrimmed videos, and is compared to other architectures and text supervision methods, and analyzed their pros and cons.
Span-based Localizing Network for Natural Language Video Localization
TLDR
This work proposes a video span localizing network (VSLNet), on top of the standard span-based QA framework, to address NLVL, and tackles the differences between NLVL and span- based QA through a simple and yet effective query-guided highlighting (QGH) strategy.
ExCL: Extractive Clip Localization Using Natural Language Descriptions
TLDR
This work proposes a novel extractive approach that predicts the start and end frames by leveraging cross-modal interactions between the text and video - this removes the need to retrieve and re-rank multiple proposal segments.
...
...