Efficient estimation for high similarities using odd sketches

@article{Mitzenmacher2014EfficientEF,
  title={Efficient estimation for high similarities using odd sketches},
  author={Michael Mitzenmacher and R. Pagh and Ninh D. Pham},
  journal={Proceedings of the 23rd international conference on World wide web},
  year={2014}
}
Estimating set similarity is a central problem in many computer applications. [] Key Method The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.
A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
TLDR
A memory efficient sketch method to accurately estimate Jaccard similarities in streaming sets, MaxLogHash, which uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set.
Bidirectionally Densifying LSH Sketches with Empty Bins
TLDR
Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that the proposed BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.
On the Similarity Search With Hamming Space Sketches
TLDR
Various challenges of the similarity search with sketches in the Hamming space are addressed, including the definition of sketching transformation and efficient search algorithms that exploit sketches to speed up searching.
Sketches with Unbalanced Bits for Similarity Search
TLDR
This work suggests to use sketches with unbalanced bits and shows that such sketches can achieve practically the same quality of similarity search and they are much easier to index thanks to the decrease of distances to the nearest neighbours.
Multi-resolution Odd Sketch for Mining Jaccard Similarities between Dynamic Streaming Sets
TLDR
A multi-resolution odd sketch (MROS) is proposed, which allows more accurate similarity estimation with less memory consumption and outperforms existing works, e.g., MinHash and VOS.
XY-Sketch: on Sketching Data Streams at Web Scale
TLDR
This paper proposes a novel structure, called XY-sketch, which estimates the frequency of a data item by estimating the probability of this item appearing in the data stream, and is orders of magnitudes more accurate than existing solutions, when the space budget is small.
Efficient binary embedding of categorical data using BinSketch
TLDR
The proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and the distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches.
Fast and Compact Hamming Distance Index
TLDR
New solutions for the approximate dictionary queries problem are proposed which combine the use of succinct data structures with an efficient representation of the keys to significantly reduce the space usage of the state-of-the-art solutions without introducing any time penalty.
Efficient Dimensionality Reduction for Sparse Binary Data
TLDR
This work provides a single sketch which simultaneously preserves multiple similarity measures including Hamming distance, Inner product, and Jaccard Similarity and gives a rigorous theoretical analysis of the dimensionality reduction bounds.
2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search
TLDR
2-bit random projections should be recommended for approximate near neighbor search and similarity estimation via hash tables and accurate nonlinear estimators of data similarity based on the 2-bit strategy are developed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
Improved Consistent Sampling, Weighted Minhash and L1 Sketching
  • S. Ioffe
  • Computer Science
    2010 IEEE International Conference on Data Mining
  • 2010
TLDR
A novel method of mapping hashes to short bit-strings, apply it to Weighted Minhash, and achieve more accurate distance estimates from sketches than existing methods, as long as the inputs are sufficiently distinct.
Hashing Algorithms for Large-Scale Learning
TLDR
It is demonstrated that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory.
b-Bit minwise hashing
TLDR
This paper establishes the theoretical framework of b-bit minwise hashing and provides an unbiased estimator of the resemblance for any b and demonstrates that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3.
Finding near-duplicate web pages: a large-scale evaluation of algorithms
TLDR
A combined algorithm is presented which achieves precision 0.79 with 79% of the recall of the other algorithms, and since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall than Broder et al.'s algorithm.
Detecting near-duplicates for web crawling
TLDR
This work demonstrates that Charikar's fingerprinting technique is appropriate for near-duplicate detection and presents an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k.
Exploiting asymmetry in hierarchical topic extraction
TLDR
Efficient algorithms using the technique of Locality-Sensitive Hashing (LSH) to extract topics from a document collection based on the asymmetric relationships between terms in a collection are presented.
Sketching Techniques for Collaborative Filtering
TLDR
A method for quickly determining the proportional intersection between the items that each of two users has examined, by sending and maintaining extremely concise "sketches" of the list of items, based on random min-wise independent hash functions.
Tracking Web spam with HTML style similarities
TLDR
This work study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code and proposes a flexible algorithm to cluster a large collection of documents according to these measures.
Finding interesting associations without support pruning
  • E. Cohen, Mayur Datar, Cheng Yang
  • Computer Science
    Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073)
  • 2000
TLDR
This work develops a family of algorithms for solving association rule mining, employing a combination of random sampling and hashing techniques and provides an analysis of the algorithms developed and conduct experiments on real and synthetic data to obtain a comparative performance analysis.
On the resemblance and containment of documents
  • A. Broder
  • Computer Science
    Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)
  • 1997
TLDR
The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
...
1
2
3
...