• Publications
  • Influence
Anonymization of Set-Valued Data via Top-Down, Local Generalization
tl;dr
We propose a top-down, partition-based approach to anonymizing set-valued data that scales linearly with the input size and scores well on an information-loss data quality metric. Expand
  • 232
  • 39
  • Open Access
ClusterJoin: A Similarity Joins Framework using Map-Reduce
tl;dr
We propose a ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they may produce join results based on distance threshold. Expand
  • 72
  • 8
  • Open Access
SEISA: set expansion by iterative similarity aggregation
tl;dr
In this paper, we study the problem of expanding a set of given seed entities into a more complete set by discovering other entities that also belong to the same concept set. Expand
  • 68
  • 8
  • Open Access
Crawling deep web entity pages
tl;dr
Deep-web crawl is concerned with the problem of surfacing hidden content behind search interfaces on the Web. Expand
  • 82
  • 7
  • Open Access
TEGRA: Table Extraction by Global Record Alignment
tl;dr
We model table extraction as a principled optimization problem -- we allocate tokens in each row sequentially to a fixed number of columns, such that the sum of coherence across all pairs of values in the same column is maximized. Expand
  • 27
  • 7
  • Open Access
Utility-maximizing event stream suppression
tl;dr
We first formally define the problem of utility-maximizing event suppression with privacy preferences, and analyze its computational hardness. Expand
  • 18
  • 5
  • Open Access
Concept Expansion Using Web Tables
tl;dr
We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output a ranked list of entities. Expand
  • 44
  • 4
  • Open Access
On Load Shedding in Complex Event Processing
tl;dr
Complex Event Processing (CEP) is a stream processing model that focuses on detecting event patterns in continuous event streams. Expand
  • 22
  • 3
  • Open Access
Preventing equivalence attacks in updated, anonymized data
tl;dr
We propose a graph-based anonymization algorithm that leverages solutions to the classic “min-cut/max-flow” problem and demonstrate with experiments that our algorithm is efficient and effective in preventing equivalence attacks. Expand
  • 45
  • 2
  • Open Access
Uni-Detect: A Unified Approach to Automated Error Detection in Tables
tl;dr
We propose \sj, a unified framework to automatically detect diverse types of errors. Expand
  • 9
  • 2