Efficient Visual Search of Videos Cast as Text Retrieval

  title={Efficient Visual Search of Videos Cast as Text Retrieval},
  author={Josef Sivic and Andrew Zisserman},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
We describe an approach to object retrieval which searches for and localizes all the occurrences of an object in a video, given a query image of the object. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject those that are unstable. Efficient retrieval is… 
Analysis of Using Metric Access Methods for Visual Search of Objects in Video Databases
An approach to object retrieval that searches for and localizes all the occurrences of an object in a video database, given a query image of the object is presented, and a ranking strategy based on the spatial layout of the regions (spatial consistency) is fully described and evaluated.
Advancing large scale object retrieval
It is shown that issuing multiple queries significantly improves recall and enables the system to find quite challenging occurrences of the queried object, and a method for automatically determining the title and sculptor of an imaged sculpture using the proposed smooth object retrieval system is described.
Spatially-aware indexing for image object retrieval
This paper proposes two spatially-aware retrieval strategies for image object retrieval that replaces the vector space model and shows a significant improvement in terms of early precision, and at the same time significantly reduce the number of candidates to be considered at retrieval time.
Object Mining for Large Video data
This paper uses information available from scripts and subtitles in order to group all occurrences of an object in video data, which provides a separate representation for each scene, and proposes a graph-based representation in which vertices represent objects rather than video frames.
Coalesced global and local feature discrimination for content-based image retrieval
A novel method is presented for image saliency detection using a more efficient color space model based on the color distribution of the images instead of the primary visual features to be proven to be more efficient.
Video Object Retrieval by Trajectory and Appearance
This paper proposes to perform a video retrieval of a desired object through the inputs of its trajectory and/or appearance, together with the help of a 3-D graphical user interface for more intuitive interactions, so that more satisfactory results can be achieved.
Query-Adaptive Multiple Instance Learning for Video Instance Retrieval
A novel query-adaptive multiple instance learning algorithm is proposed, which exploits the visual appearance information of the OOI from the query and that of the aforementioned video frames, which would exhibit additional discriminating abilities while retrieving relevant instances.
Image Retrieval with a Visual Thesaurus
The method in this paper is borrowed from text retrieval, and is analogous to a text thesaurus in that it describes a broad set of equivalence relationship between words.
Video querying via compact descriptors of visually salient objects
New feature-agnostic approaches for efficient retrieval of similar video content are investigated and it is suggested that compact descriptors obtained via low-rank matrix factorization improve discriminability and robustness to parameter selection compared to k-means clustering.
Detecting objects in large image collections and videos by efficient subimage retrieval
An extensive evaluation on several datasets shows that ESR is not only very fast, but it also achieves excellent detection accuracies thereby improving over previous systems for object-based image retrieval.


Video Google: a text retrieval approach to object matching in videos
We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint
Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval
This paper brings query expansion into the visual domain via two novel contributions: strong spatial constraints between the query image and each result allow us to accurately verify each return, suppressing the false positives which typically ruin text-based query expansion.
Object retrieval with large vocabularies and fast spatial matching
To improve query performance, this work adds an efficient spatial verification stage to re-rank the results returned from the bag-of-words model and shows that this consistently improves search quality, though by less of a margin when the visual vocabulary is large.
Scalable Recognition with a Vocabulary Tree
  • D. Nistér, Henrik Stewénius
  • Computer Science
    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)
  • 2006
A recognition scheme that scales efficiently to a large number of objects and allows a larger and more discriminatory vocabulary to be used efficiently is presented, which it is shown experimentally leads to a dramatic improvement in retrieval quality.
Person Spotting: Video Shot Retrieval for Face Sets
Progress is described in harnessing multiple exemplars of each person in a form that can easily be associated automatically using straightforward visual tracking in order to retrieve humans automatically in videos, given a query face in a shot.
Object Level Grouping for Video Shots
A method for automatically obtaining object representations suitable for retrieval from generic video shots that includes associating regions within a single shot to represent a deforming object and an affine factorization method that copes with motion degeneracy.
Automated location matching in movies
Sub-linear Indexing for Large Scale Object Recognition
A method capable of recognising one of N objects in log(N) time, which preserves all the strengths of local affine region methods – robustness to background clutter, occlusion, and large changes of viewpoints.
Shape recognition with edge-based features
An approach to recognizing poorly textured objects, that may contain holes and tubular parts, in cluttered scenes under arbitrary viewing conditions is described and a new edge-based local feature detector that is invariant to similarity transformations is introduced.