Probabilistic Blocking with an Application to the Syrian Conflict

  title={Probabilistic Blocking with an Application to the Syrian Conflict},
  author={Rebecca C. Steorts and Anshumali Shrivastava},
Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce k-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic… 



Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search

The heart of the proposed hash function is a "rotation" scheme which densifies the sparse sketches of one permutation hashing in an unbiased fashion thereby maintaining the LSH property, which makes the obtained sketches suitable for hash table construction.

Locality sensitive hashing: A comparison of hash function types and querying mechanisms

Exploiting asymmetry in hierarchical topic extraction

Efficient algorithms using the technique of Locality-Sensitive Hashing (LSH) to extract topics from a document collection based on the asymmetric relationships between terms in a collection are presented.

Improved Densification of One Permutation Hashing

A new densification procedure is provided which is provably better than the existing scheme and has the same cost of $O(d + KL)$ for query processing, thereby making it strictly preferable over the existing procedure.

Similarity-aware indexing for real-time entity resolution

Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach.

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

  • P. Christen
  • Computer Science
    IEEE Transactions on Knowledge and Data Engineering
  • 2012
A survey of 12 variations of 6 indexing techniques for record linkage and deduplication aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality is presented.

Blocking Methods Applied to Casualty Records from the Syrian Conflict

This work proposes locality sensitive hashing (LSH) methods for estimation of death counts in Syria and demonstrates the computational superiority and error rates of these methods by comparing their proposed approach with others in the literature.

In Defense of Minhash over Simhash

A theoretical answer is provided (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search.

An Evaluation Framework for Privacy-Preserving Record Linkage

A general framework with normalized measures to practically evaluate and compare PPRL solutions in the face of linkage attack methods that are based on an external global dataset is proposed and the results show that the framework provides an extensive and comparative evaluation of PPRl solutions in terms of the three properties.

Improved Consistent Sampling, Weighted Minhash and L1 Sketching

  • S. Ioffe
  • Computer Science
    2010 IEEE International Conference on Data Mining
  • 2010
A novel method of mapping hashes to short bit-strings, apply it to Weighted Minhash, and achieve more accurate distance estimates from sketches than existing methods, as long as the inputs are sufficiently distinct.