Corpus ID: 6482591

Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases

@inproceedings{Bilenko2002LearningTC,
  title={Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases},
  author={Mikhail Bilenko and Raymond J. Mooney},
  year={2002}
}
  • Mikhail Bilenko, Raymond J. Mooney
  • Published 2002
  • Mathematics
  • The problem of identifying approximately duplicate records in databases has previously been studied as record linkage, the merge/purge problem, hardening soft databases, and field matching. Most existing approaches have focused on efficient algorithms for locating potential duplicates rather than precise similarity metrics for comparing records. In this paper, we present a domain-independent method for improving duplicate detection accuracy using machine learning. First, trainable distance… CONTINUE READING

    Create an AI-powered research feed to stay up to date with new papers like this posted to ArXiv

    Figures and Tables from this paper.

    Citations

    Publications citing this paper.
    SHOWING 1-10 OF 63 CITATIONS

    Semantic annotation and object extraction for very high resolution satellite images

    VIEW 9 EXCERPTS
    CITES BACKGROUND
    HIGHLY INFLUENCED

    Unsupervised Duplicate Detection Using Sample Non-duplicates

    VIEW 4 EXCERPTS
    CITES METHODS
    HIGHLY INFLUENCED

    Mining for Information Discovery on the Web: Overview and Illustrative Research

    VIEW 6 EXCERPTS
    CITES BACKGROUND & METHODS
    HIGHLY INFLUENCED

    A Comparison of String Distance Metrics for Name-Matching Tasks

    VIEW 2 EXCERPTS
    CITES METHODS
    HIGHLY INFLUENCED

    Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference

    VIEW 3 EXCERPTS
    CITES BACKGROUND & METHODS
    HIGHLY INFLUENCED

    Supervised Hierarchical Clustering with Exponential Linkage

    VIEW 3 EXCERPTS
    CITES METHODS & BACKGROUND

    A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution

    VIEW 1 EXCERPT
    CITES METHODS

    Distance-based Methods

    VIEW 1 EXCERPT
    CITES METHODS

    FILTER CITATIONS BY YEAR

    2002
    2019

    CITATION STATISTICS

    • 7 Highly Influenced Citations

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 22 REFERENCES

    Learning String-Edit Distance

    VIEW 5 EXCERPTS
    HIGHLY INFLUENTIAL

    The merge/purge problem for large databases

    VIEW 7 EXCERPTS
    HIGHLY INFLUENTIAL

    D

    • R. Ghani, R. Jones
    • Mladeni ́ c, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the web. In D. Mladenić, editor,Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining
    • 2000
    VIEW 1 EXCERPT