• Publications
  • Influence
ImageNet: A large-scale hierarchical image database
TLDR
A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Efficient k-nearest neighbor graph construction for generic similarity measures
TLDR
N-Descent is presented, a simple yet efficient algorithm for approximate K-NNG construction with arbitrary similarity measures that typically converges to above 90% recall with each point comparing only to several percent of the whole dataset on average.
Modeling LSH for performance tuning
TLDR
A statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH is presented, which can accurately predict the average search quality and latency given a small sample dataset and an adaptive LSH search algorithm is devised to determine the probing parameter dynamically for each query.
Tradeoffs in Scalable Data Routing for Deduplication Clusters
TLDR
A cluster-based deduplication system that can dedupleicate with high throughput, support dedUplication ratios comparable to that of a single system, and maintain a low variation in the storage utilization of individual nodes is presented.
Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces
TLDR
An efficient sketch algorithm for similarity search with L2 distances and a novel asymmetric distance estimation technique that takes advantage of the original feature vector of the query to boost the distance estimation accuracy.
Efficiently matching sets of features with random histograms
TLDR
A randomized algorithm to embed a set of features into a single high-dimensional vector to simplify the feature-set matching problem and can achieve accuracy comparable to the state-of-the-art feature- set matching methods, while requiring significantly less space and time.
High-confidence near-duplicate image detection
TLDR
It is shown that entropy-based filtering eliminates ambiguous SIFT features that cause most of the false positives, and enables claiming near-duplicity with a single match of the retained high-quality features, and that graph cut can be used for query expansion with a duplicity graph computed offline to substantially improve search quality.
Sizing sketches: a rank-based analysis for similarity search
TLDR
An arank-based filtering model that describes the relationship between sketch size and data set size based on the dataset distance distribution is presented and the resulting model can make good predictions for a large dataset.
Document Hashing with Mixture-Prior Generative Models
TLDR
Two mixture-prior generative models are proposed, under the objective to produce high-quality hashing codes for documents, and Experimental results on several benchmark datasets demonstrate that the proposed methods consistently outperform existing ones by a substantial margin.
High-dimensional similarity search for large datasets
TLDR
This dissertation studies several key issues to improve the accuracy and efficiency of high-dimensional similarity search and develops a scheme to compactly represent sets of feature vectors, an increasingly popular data representation that is more accurate than single vectors, but also more expensive.
...
...