A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
N-Descent is presented, a simple yet efficient algorithm for approximate K-NNG construction with arbitrary similarity measures that typically converges to above 90% recall with each point comparing only to several percent of the whole dataset on average.
A statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH is presented, which can accurately predict the average search quality and latency given a small sample dataset and an adaptive LSH search algorithm is devised to determine the probing parameter dynamically for each query.
A cluster-based deduplication system that can dedupleicate with high throughput, support dedUplication ratios comparable to that of a single system, and maintain a low variation in the storage utilization of individual nodes is presented.
An efficient sketch algorithm for similarity search with L2 distances and a novel asymmetric distance estimation technique that takes advantage of the original feature vector of the query to boost the distance estimation accuracy.
A randomized algorithm to embed a set of features into a single high-dimensional vector to simplify the feature-set matching problem and can achieve accuracy comparable to the state-of-the-art feature- set matching methods, while requiring significantly less space and time.
It is shown that entropy-based filtering eliminates ambiguous SIFT features that cause most of the false positives, and enables claiming near-duplicity with a single match of the retained high-quality features, and that graph cut can be used for query expansion with a duplicity graph computed offline to substantially improve search quality.
An arank-based filtering model that describes the relationship between sketch size and data set size based on the dataset distance distribution is presented and the resulting model can make good predictions for a large dataset.
Two mixture-prior generative models are proposed, under the objective to produce high-quality hashing codes for documents, and Experimental results on several benchmark datasets demonstrate that the proposed methods consistently outperform existing ones by a substantial margin.
This dissertation studies several key issues to improve the accuracy and efficiency of high-dimensional similarity search and develops a scheme to compactly represent sets of feature vectors, an increasingly popular data representation that is more accurate than single vectors, but also more expensive.