Corpus ID: 236428887

Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation

  title={Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation},
  author={Jiabo Huang and Yang Liu and Shaogang Gong and Hailin Jin},
  • Jiabo Huang, Yang Liu, +1 author Hailin Jin
  • Published 2021
  • Computer Science
  • ArXiv
Video activity localisation has recently attained increasing attention due to its practical values in automatically localising the most salient visual segments corresponding to their language descriptions (sentences) from untrimmed and unstructured videos. For supervised model training, a temporal annotation of both the start and end time index of each video segment for a sentence (a video moment) must be given. This is not only very expensive but also sensitive to ambiguity and subjective… Expand

Figures and Tables from this paper

A Survey on Temporal Sentence Grounding in Videos
This survey gives a comprehensive overview of existing TSGV approaches and provides a detailed description of the evaluation protocols to be used in TSGV, and in-depth discusses potential problems of current benchmarking designs and research directions for further investigations. Expand


TALL: Temporal Activity Localization via Language Query
A novel Cross-modal Temporal Regression Localizer (CTRL) is proposed to jointly model text query and video clips, output alignment scores and action boundary regression results for candidate clips, and Experimental results show that CTRL outperforms previous methods significantly on both datasets. Expand
Weakly Supervised Video Moment Retrieval From Text Queries
This work proposes a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions using Text-Guided Attention (TGA). Expand
Semantic Proposal for Activity Localization in Videos via Sentence Query
This paper proposes a novel Semantic Activity Proposal (SAP) which integrates the semantic information of sentence queries into the proposal generation process to get discriminative activity proposals and evaluates the algorithm on the TACoS dataset and the Charades-STA dataset. Expand
Local-Global Video-Text Interactions for Temporal Grounding
This paper addresses the problem of text-to-video temporal grounding using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query, which corresponds to important semantic entities described in the query. Expand
Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language
A novel TALL method is proposed which builds a Hierarchical Visual-Textual Graph to model interactions between the objects and words as well as among the objects to jointly understand the video contents and the language. Expand
Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos
A novel semantic conditioned dynamic modulation mechanism, which leverages the sentence semantics to modulate the temporal convolution operations for better correlating and composing the sentence-relevant video contents over time, is proposed. Expand
Weakly-Supervised Video Moment Retrieval via Semantic Completion Network
This paper proposes a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training, and devise a proposal generation module that aggregates the context information to generate and score all candidate proposals in one single pass. Expand
Cross-modal Moment Localization in Videos
The proposed model, a language-temporal attention network is utilized to learn the word attention based on the temporal context information in the video and can automatically select "what words to listen to" for localizing the desired moment. Expand
LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval
An efficient Latent Graph Co-Attention Network (LoGAN) is proposed that exploits fine-grained frame-by-word interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Expand
Localizing Natural Language in Videos
A localizing network (LNet), working in an end-to-end fashion, to tackle the NLVL task, which first match the natural sentence and video sequence by cross-gated attended recurrent networks to exploit their fine-grained interactions and generate a sentence-aware video representation. Expand