• Publications
  • Influence
Discovering data quality rules
TLDR
This work proposes a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Expand
Framework for Evaluating Clustering Algorithms in Duplicate Detection
TLDR
This work uses Stringer to evaluate the quality of the clusters obtained from several unconstrained clustering algorithms used in concert with approximate join techniques and reveals that some clustering algorithm that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability. Expand
A unified model for data and constraint repair
TLDR
This work presents a novel unified cost model that allows data and constraint repairs to be compared on an equal footing, and considers repairs over a database that is inconsistent with respect to a set of rules, modeled as functional dependencies (FDs). Expand
Seeking Stable Clusters in the Blogosphere
TLDR
This paper formalizes intuition and presents efficient algorithms to identify keyword clusters in large collections of blog posts for specific temporal intervals, and formalizes problems related to the temporal properties of such clusters. Expand
Continuous data cleaning
TLDR
This work introduces a continuous data cleaning framework that can be applied to dynamic data and constraint environments and uses not only the data and constraints as evidence, but also considers the past repairs chosen and applied by a user (user repair preferences). Expand
An Algebraic Approach Towards Data Cleaning
TLDR
This paper formally defines the notion of association rule, association function, and presents results relating to these concepts, and proposes an algorithm for generating association rules from a given structured data set. Expand
AutoDict: Automated Dictionary Discovery
TLDR
This demonstration will showcase the different information analysis and extraction features within AutoDict, and highlight the process of generating high quality attribute dictionaries. Expand
Restoring Consistency in Ontological Multidimensional Data Models via Weighted Repairs
TLDR
This paper proposes a framework of data quality assessment, and repair for the OMD, formally defines a weight-based repair-by-deletion semantics, and presents an automatic weight generation mechanism that considers multiple input criteria. Expand
Combining Quantitative and Logical Data Cleaning
TLDR
A novel framework within which quantitative and logical data cleaning approaches can be used synergistically to combine their respective strengths is proposed, and it is proved that every instance that can be generated by the repair algorithm is set-minimal. Expand
Efficient Discovery of Ontology Functional Dependencies
TLDR
It is shown that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs, and an algorithm for discovering OFDs from data that uses the axioms to prune the exponential search space in the numberof attributes. Expand
...
1
2
3
4
5
...