Probabilistic Blocking with an Application to the Syrian Conflict

@inproceedings{Steorts2018ProbabilisticBW,
  title={Probabilistic Blocking with an Application to the Syrian Conflict},
  author={Rebecca C. Steorts and Anshumali Shrivastava},
  booktitle={PSD},
  year={2018}
}
Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce k-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic… 

References

SHOWING 1-10 OF 26 REFERENCES

A Comparison of Blocking Methods for Record Linkage

Traditional blocking techniques are reviewed, and two variants of a method known as locality sensitive hashing, sometimes referred to as “private blocking,” are considered, in terms of their recall, reduction ratio, and computational complexity.

Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search

The heart of the proposed hash function is a "rotation" scheme which densifies the sparse sketches of one permutation hashing in an unbiased fashion thereby maintaining the LSH property, which makes the obtained sketches suitable for hash table construction.

Locality sensitive hashing: A comparison of hash function types and querying mechanisms

Exploiting asymmetry in hierarchical topic extraction

Efficient algorithms using the technique of Locality-Sensitive Hashing (LSH) to extract topics from a document collection based on the asymmetric relationships between terms in a collection are presented.

Improved Densification of One Permutation Hashing

A new densification procedure is provided which is provably better than the existing scheme and has the same cost of $O(d + KL)$ for query processing, thereby making it strictly preferable over the existing procedure.

Similarity-aware indexing for real-time entity resolution

Experimental results on a real-world database indicate that the total size of all data structures of this novel index approach grows sub-linearly with the size of the database, and that it allows matching of query records in sub-second time, more than two orders of magnitude faster than a traditional entity resolution index approach.

A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

  • P. Christen
  • Computer Science
    IEEE Transactions on Knowledge and Data Engineering
  • 2012
A survey of 12 variations of 6 indexing techniques for record linkage and deduplication aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality is presented.

Similarity Search in High Dimensions via Hashing

Experimental results indicate that the novel scheme for approximate similarity search based on hashing scales well even for a relatively large number of dimensions, and provides experimental evidence that the method gives improvement in running time over other methods for searching in highdimensional spaces based on hierarchical tree decomposition.

Blocking Methods Applied to Casualty Records from the Syrian Conflict

This work proposes locality sensitive hashing (LSH) methods for estimation of death counts in Syria and demonstrates the computational superiority and error rates of these methods by comparing their proposed approach with others in the literature.

In Defense of Minhash over Simhash

A theoretical answer is provided (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search.