Sign-Full Random Projections

  title={Sign-Full Random Projections},
  author={Ping Li},
  • Ping Li
  • Published 26 April 2018
  • Computer Science, Mathematics
  • ArXiv
The method of 1-bit (“sign-sign”) random projections has been a popular tool for efficient search and machine learning on large datasets. Given two D-dim data vectors u, v ∈ ℝD, one can generate x = ∑i=1D uiri, and y = ∑i=1D viri, where ri ∼ N(0, 1) iid. Then one can estimate the cosine similarity ρ from sgn(x) and sgn(y). In this paper, we study a series of estimators for “sign-full” random projections. First we prove E(sgn(x)y) = √2/πρ, which provides an estimator for ρ. Interestingly this… Expand
Random Projections with Asymmetric Quantization
This paper thoroughly analyzes the biases and variances of a series of estimators including the basic simple estimators, their normalized versions, and their debiased versions and shows that the expectation of proposed estimators increases with the true cosine similarity, on a broader family of stair-shaped quantizers. Expand
Improving Random Projections With Extra Vectors to Approximate Inner Products
This research shows how the variance of estimated inner products is reduced as more vectors are added, how variance reduction is related to the geometry of the dataset and moreover, the asymptotic behaviour of the variance as the number of extra vectors added goes to infinity. Expand
Breaking the waves: asymmetric random periodic features for low-bitrate kernel machines
The take-home message is that when the features of only one of the two signals are quantized, the original kernel is recovered without distortion; its practical interest appears in several cases where the kernel evaluations are asymmetric by nature, such as a client-server scheme. Expand


Similarity estimation techniques from rounding algorithms
It is shown that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Expand
Locality-sensitive hashing scheme based on p-stable distributions
A novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions that improves the running time of the earlier algorithm and yields the first known provably efficient approximate NN algorithm for the case p<1. Expand
Improved Approximation Algorithms for Maximum Graph Partitioning Problems
It is proved that a sub-optimal choice of these parameters can be obtained by the solution of a linear program which leads to the desired improvement of the approximation factors. Expand
Finding near-duplicate web pages: a large-scale evaluation of algorithms
A combined algorithm is presented which achieves precision 0.79 with 79% of the recall of the other algorithms, and since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall than Broder et al.'s algorithm. Expand
Improving Random Projections Using Marginal Information
We present an improved version of random projections that takes advantage of marginal norms. Using a maximum likelihood estimator (MLE), margin-constrained random projections can improve estimationExpand
Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming
This algorithm gives the first substantial progress in approximating MAX CUT in nearly twenty years, and represents the first use of semidefinite programming in the design of approximation algorithms. Expand
Detecting near-duplicates for web crawling
This work demonstrates that Charikar's fingerprinting technique is appropriate for near-duplicate detection and presents an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Expand
An Introduction to Multivariate Statistical Analysis
Preface to the Third Edition.Preface to the Second Edition.Preface to the First Edition.1. Introduction.2. The Multivariate Normal Distribution.3. Estimation of the Mean Vector and the CovarianceExpand
Learning mixtures of Gaussians
  • S. Dasgupta
  • Mathematics, Computer Science
  • 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039)
  • 1999
This work presents the first provably correct algorithm for learning a mixture of Gaussians, which returns the true centers of the Gaussian to within the precision specified by the user with high probability. Expand
Latent semantic indexing: a probabilistic analysis
It is proved that under certain conditions LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. Expand