# Efficient estimation for high similarities using odd sketches

@article{Mitzenmacher2014EfficientEF, title={Efficient estimation for high similarities using odd sketches}, author={Michael Mitzenmacher and R. Pagh and Ninh D. Pham}, journal={Proceedings of the 23rd international conference on World wide web}, year={2014} }

Estimating set similarity is a central problem in many computer applications. [... ] Key Method The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks. Expand

## Figures from this paper

## 52 Citations

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

- Computer ScienceKDD
- 2019

A memory efficient sketch method to accurately estimate Jaccard similarities in streaming sets, MaxLogHash, which uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set.

Bidirectionally Densifying LSH Sketches with Empty Bins

- Computer ScienceSIGMOD Conference
- 2021

Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that the proposed BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.

On the Similarity Search With Hamming Space Sketches

- Computer Science
- 2021

Various challenges of the similarity search with sketches in the Hamming space are addressed, including the definition of sketching transformation and efficient search algorithms that exploit sketches to speed up searching.

Sketches with Unbalanced Bits for Similarity Search

- Computer ScienceSISAP
- 2017

This work suggests to use sketches with unbalanced bits and shows that such sketches can achieve practically the same quality of similarity search and they are much easier to index thanks to the decrease of distances to the nearest neighbours.

Multi-resolution Odd Sketch for Mining Jaccard Similarities between Dynamic Streaming Sets

- Computer Science2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD)
- 2021

A multi-resolution odd sketch (MROS) is proposed, which allows more accurate similarity estimation with less memory consumption and outperforms existing works, e.g., MinHash and VOS.

XY-Sketch: on Sketching Data Streams at Web Scale

- Computer ScienceWWW
- 2021

This paper proposes a novel structure, called XY-sketch, which estimates the frequency of a data item by estimating the probability of this item appearing in the data stream, and is orders of magnitudes more accurate than existing solutions, when the space budget is small.

Efficient binary embedding of categorical data using BinSketch

- Computer ScienceData Min. Knowl. Discov.
- 2022

The proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and the distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches.

Fast and Compact Hamming Distance Index

- Computer ScienceIIR
- 2016

New solutions for the approximate dictionary queries problem are proposed which combine the use of succinct data structures with an efficient representation of the keys to significantly reduce the space usage of the state-of-the-art solutions without introducing any time penalty.

Efficient Dimensionality Reduction for Sparse Binary Data

- Computer Science2018 IEEE International Conference on Big Data (Big Data)
- 2018

This work provides a single sketch which simultaneously preserves multiple similarity measures including Hamming distance, Inner product, and Jaccard Similarity and gives a rigorous theoretical analysis of the dimensionality reduction bounds.

2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search

- Computer ScienceArXiv
- 2016

2-bit random projections should be recommended for approximate near neighbor search and similarity estimation via hash tables and accurate nonlinear estimators of data similarity based on the 2-bit strategy are developed.

## References

SHOWING 1-10 OF 24 REFERENCES

Improved Consistent Sampling, Weighted Minhash and L1 Sketching

- Computer Science2010 IEEE International Conference on Data Mining
- 2010

A novel method of mapping hashes to short bit-strings, apply it to Weighted Minhash, and achieve more accurate distance estimates from sketches than existing methods, as long as the inputs are sufficiently distinct.

Hashing Algorithms for Large-Scale Learning

- Computer ScienceNIPS
- 2011

It is demonstrated that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory.

b-Bit minwise hashing

- Computer ScienceWWW '10
- 2010

This paper establishes the theoretical framework of b-bit minwise hashing and provides an unbiased estimator of the resemblance for any b and demonstrates that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3.

Finding near-duplicate web pages: a large-scale evaluation of algorithms

- Computer ScienceSIGIR
- 2006

A combined algorithm is presented which achieves precision 0.79 with 79% of the recall of the other algorithms, and since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall than Broder et al.'s algorithm.

Detecting near-duplicates for web crawling

- Computer ScienceWWW '07
- 2007

This work demonstrates that Charikar's fingerprinting technique is appropriate for near-duplicate detection and presents an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k.

Exploiting asymmetry in hierarchical topic extraction

- Computer ScienceCIKM '06
- 2006

Efficient algorithms using the technique of Locality-Sensitive Hashing (LSH) to extract topics from a document collection based on the asymmetric relationships between terms in a collection are presented.

Sketching Techniques for Collaborative Filtering

- Computer ScienceIJCAI
- 2009

A method for quickly determining the proportional intersection between the items that each of two users has examined, by sending and maintaining extremely concise "sketches" of the list of items, based on random min-wise independent hash functions.

Tracking Web spam with HTML style similarities

- Computer ScienceTWEB
- 2008

This work study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code and proposes a flexible algorithm to cluster a large collection of documents according to these measures.

Finding interesting associations without support pruning

- Computer ScienceProceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073)
- 2000

This work develops a family of algorithms for solving association rule mining, employing a combination of random sampling and hashing techniques and provides an analysis of the algorithms developed and conduct experiments on real and synthetic data to obtain a comparative performance analysis.

On the resemblance and containment of documents

- Computer ScienceProceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171)
- 1997

The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.