• Publications
  • Influence
Discovering data quality rules
TLDR
We propose a data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Expand
  • 244
  • 33
  • PDF
Framework for Evaluating Clustering Algorithms in Duplicate Detection
TLDR
In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Expand
  • 195
  • 21
  • PDF
A unified model for data and constraint repair
  • Fei Chiang, R. Miller
  • Computer Science
  • IEEE 27th International Conference on Data…
  • 11 April 2011
TLDR
We present a novel unified cost model for constraint repair that allows data and constraint repairs to be compared on an equal footing, and present a qualitative case study using a well-known real dataset. Expand
  • 82
  • 6
  • PDF
Seeking Stable Clusters in the Blogosphere
TLDR
In this paper, we formalize this intuition and present efficient algorithms to identify keyword clusters in large collections of blog posts for specific temporal intervals. Expand
  • 77
  • 5
  • PDF
Continuous data cleaning
TLDR
We introduce a continuous data cleaning framework that adapts to the natural evolution in the data and in the constraints over time, and generate high quality repairs. Expand
  • 86
  • 1
  • PDF
AutoDict: Automated Dictionary Discovery
TLDR
We present AutoDict, a novel dictionary discovery tool that incorporates a set of measures including information content, similarity, and conviction, to produce relevant and accurate dictionaries. Expand
  • 8
  • 1
  • PDF
An Algebraic Approach Towards Data Cleaning
TLDR
We propose the use of information algebra as a general theory to describe structured data sets and data cleaning. Expand
  • 8
  • 1
Combining Quantitative and Logical Data Cleaning
TLDR
We propose a new constraint-based cleaning strategy in which we use statistical distortion during cleaning to ensure the chosen (minimal) repair is of high quality. Expand
  • 74
  • PDF
CurrentClean: Interactive Change Exploration and Cleaning of Stale Data
TLDR
We develop CurrentClean, a probabilistic system for identifying and cleaning stale values, and enables a user to interactively explore change in her data. Expand
  • 1
Active repair of data quality rules
TLDR
The use of data quality rules, which capture business rules and domain constraints, is central to most data quality processes. Expand
  • 10
  • PDF