Spatio-Temporal Person Retrieval via Natural Language Queries
@article{Yamaguchi2017SpatioTemporalPR, title={Spatio-Temporal Person Retrieval via Natural Language Queries}, author={Masataka Yamaguchi and Kuniaki Saito and Y. Ushiku and Tatsuya Harada}, journal={2017 IEEE International Conference on Computer Vision (ICCV)}, year={2017}, pages={1462-1471} }
In this paper, we address the problem of spatio-temporal person retrieval from videos using a natural language query, in which we output a tube (i.e., a sequence of bounding boxes) which encloses the person described by the query. [] Key Method To retrieve the tube of the person described by a given natural language query, we design a model that combines methods for spatio-temporal human detection and multimodal retrieval. We conduct comprehensive experiments to compare a variety of tube and text…
Figures and Tables from this paper
36 Citations
Person Tube Retrieval via Language Description
- Computer ScienceAAAI
- 2020
Experimental results on person tube retrieval via language description and other two related tasks demonstrate the efficacy of Multi-Scale Structure Preservation (MSSP).
Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video
- Computer ScienceACL
- 2019
This paper localizes a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio/temporal annotations during training, and proposes a new attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones.
Referring to Objects in Videos Using Spatio-Temporal Identifying Descriptions
- Computer ScienceProceedings of the Second Workshop on Shortcomings in Vision and Language
- 2019
A two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion is proposed and it is shown that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules.
Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval
- Computer ScienceIEEE Transactions on Image Processing
- 2021
A novel binary representation learning framework, named Semantics-aware Spatial-temporal Binaries, which simultaneously considers spatial-tem temporal context and semantic relationships for cross-modal video retrieval and adopts an iterative optimization scheme to learn deep encoding functions with attribute-guided stochastic training.
STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work proposes a one-stage visual-linguistic transformer based framework called STVGBert for the STVG task, which can simultaneously localize the target object in both spatial and temporal domains and is believed to be the first one- stage method which can handle the ST VG task without relying on any pre-trained object detectors.
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A novel Spatio-Temporal Graph Reasoning Network (STGRN) is proposed for this task, which builds a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames.
SBNet: Segmentation-based Network for Natural Language-based Vehicle Search
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
- 2021
A deep neural network called SBNet that performs natural language-based segmentation for vehicle retrieval and two task-specific modules to improve performance are proposed: a substitution module that helps features from different domains to be embedded in the same space and a future prediction module that learns temporal information.
Person Search by Queried Description in Vietnamese Natural Language
- Computer ScienceACIIDS
- 2020
Gated Neural Attention - Recurrent Neural Network (GNA-RNN) is employed to learn the affinity from pairs of description and image and then to estimate the similarity between query and images in the database.
Grounded Video Description
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
A novel video description model is proposed which is able to exploit bounding box annotations and achieves state-of-the-art performance on video description, video paragraph description, and image description and demonstrates the authors' generated sentences are better grounded in the video.
Adversarial Attribute-Text Embedding for Person Search With Natural Language Query
- Computer ScienceIEEE Transactions on Multimedia
- 2020
A novel Adversarial Attribute-Text Embedding (AATE) network for person search with text query is proposed, in particular, a cross-modal adversarial learning module is proposed to learn discriminative and modality-invariant visual-textual features.
References
SHOWING 1-10 OF 59 REFERENCES
Natural Language Object Retrieval
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
Experimental results demonstrate that the SCRC model effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.
Visual Semantic Search: Retrieving Videos via Complex Textual Queries
- Computer Science2014 IEEE Conference on Computer Vision and Pattern Recognition
- 2014
This paper first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm, and learns the importance of each term using structure prediction.
Person Search with Natural Language Description
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
An Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to establish the state-of-the art performance on person search and a large-scale person description dataset with detailed natural language annotations and person samples from various sources is collected.
Adding Semantics to Detectors for Video Retrieval
- Computer ScienceIEEE Transactions on Multimedia
- 2007
An automatic video retrieval method based on high-level concept detectors, i.e., a set of machine learned concept detectors that is enriched with semantic descriptions and semantic structure obtained from WordNet, and their combined potential using oracle fusion is discussed.
Object Instance Search in Videos via Spatio-Temporal Trajectory Discovery
- Computer ScienceIEEE Transactions on Multimedia
- 2016
The use of spatio-temporal cues to improve the quality of object instance search from videos is explored and the key bottleneck in applying this approach is solved by leveraging a randomized approach to enable fast scoring of any bounding boxes in the video volume.
Grounding of Textual Phrases in Images by Reconstruction
- Computer ScienceECCV
- 2016
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.
Video Google: a text retrieval approach to object matching in videos
- Computer ScienceProceedings Ninth IEEE International Conference on Computer Vision
- 2003
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint…
Video search in concept subspace: a text-like paradigm
- Computer ScienceCIVR '07
- 2007
This paper proposes a video search framework which operates like searching text documents, and first selects a few related concepts for a given query, by employing a tf-idf like scheme, called c-tf-idF, to measure the informativeness of the concepts to this query.
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)
- Computer ScienceIJCAI
- 2013
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
Skip-Thought Vectors
- Computer ScienceNIPS
- 2015
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the…