Spatio-Temporal Person Retrieval via Natural Language Queries

  title={Spatio-Temporal Person Retrieval via Natural Language Queries},
  author={Masataka Yamaguchi and Kuniaki Saito and Y. Ushiku and Tatsuya Harada},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
In this paper, we address the problem of spatio-temporal person retrieval from videos using a natural language query, in which we output a tube (i.e., a sequence of bounding boxes) which encloses the person described by the query. [] Key Method To retrieve the tube of the person described by a given natural language query, we design a model that combines methods for spatio-temporal human detection and multimodal retrieval. We conduct comprehensive experiments to compare a variety of tube and text…

Figures and Tables from this paper

Person Tube Retrieval via Language Description
Experimental results on person tube retrieval via language description and other two related tasks demonstrate the efficacy of Multi-Scale Structure Preservation (MSSP).
Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
This paper localizes a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio/temporal annotations during training, and proposes a new attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones.
Referring to Objects in Videos Using Spatio-Temporal Identifying Descriptions
A two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion is proposed and it is shown that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules.
Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval
A novel binary representation learning framework, named Semantics-aware Spatial-temporal Binaries, which simultaneously considers spatial-tem temporal context and semantic relationships for cross-modal video retrieval and adopts an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training.
STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
  • Rui Su, Dong Xu
  • Computer Science
    2021 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2021
This work proposes a one-stage visual-linguistic transformer based framework called STVGBert for the STVG task, which can simultaneously localize the target object in both spatial and temporal domains and is believed to be the first one- stage method which can handle the ST VG task without relying on any pre-trained object detectors.
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
A novel Spatio-Temporal Graph Reasoning Network (STGRN) is proposed for this task, which builds a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames.
SBNet: Segmentation-based Network for Natural Language-based Vehicle Search
A deep neural network called SBNet that performs natural language-based segmentation for vehicle retrieval and two task-specific modules to improve performance are proposed: a substitution module that helps features from different domains to be embedded in the same space and a future prediction module that learns temporal information.
Person Search by Queried Description in Vietnamese Natural Language
Gated Neural Attention - Recurrent Neural Network (GNA-RNN) is employed to learn the affinity from pairs of description and image and then to estimate the similarity between query and images in the database.
Grounded Video Description
A novel video description model is proposed which is able to exploit bounding box annotations and achieves state-of-the-art performance on video description, video paragraph description, and image description and demonstrates the authors' generated sentences are better grounded in the video.
Adversarial Attribute-Text Embedding for Person Search With Natural Language Query
A novel Adversarial Attribute-Text Embedding (AATE) network for person search with text query is proposed, in particular, a cross-modal adversarial learning module is proposed to learn discriminative and modality-invariant visual-textual features.


Natural Language Object Retrieval
Experimental results demonstrate that the SCRC model effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.
Visual Semantic Search: Retrieving Videos via Complex Textual Queries
This paper first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm, and learns the importance of each term using structure prediction.
Person Search with Natural Language Description
An Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to establish the state-of-the art performance on person search and a large-scale person description dataset with detailed natural language annotations and person samples from various sources is collected.
Adding Semantics to Detectors for Video Retrieval
An automatic video retrieval method based on high-level concept detectors, i.e., a set of machine learned concept detectors that is enriched with semantic descriptions and semantic structure obtained from WordNet, and their combined potential using oracle fusion is discussed.
Object Instance Search in Videos via Spatio-Temporal Trajectory Discovery
The use of spatio-temporal cues to improve the quality of object instance search from videos is explored and the key bottleneck in applying this approach is solved by leveraging a randomized approach to enable fast scoring of any bounding boxes in the video volume.
Grounding of Textual Phrases in Images by Reconstruction
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.
Video Google: a text retrieval approach to object matching in videos
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint
Video search in concept subspace: a text-like paradigm
This paper proposes a video search framework which operates like searching text documents, and first selects a few related concepts for a given query, by employing a tf-idf like scheme, called c-tf-idF, to measure the informativeness of the concepts to this query.
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the