• Publications
  • Influence
Anonymization of Set-Valued Data via Top-Down, Local Generalization
TLDR
A top-down, partition-based approach to anonymizing set-valued data that scales linearly with the input size and scores well on an information-loss data quality metric is proposed. Expand
SEISA: set expansion by iterative similarity aggregation
TLDR
A new general framework based on iterative similarity aggregation is proposed, and results are presented to show that, when using general-purpose web data for set expansion, this approach outperforms previous techniques in terms of both precision and recall. Expand
ClusterJoin: A Similarity Joins Framework using Map-Reduce
TLDR
A ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they may produce join results based onThe distance threshold, and develops a dynamic load balancing scheme using sampling, which provides strong probabilistic guarantees on the size of partitions, and greatly improves scalability. Expand
Crawling deep web entity pages
TLDR
This work describes a prototype system built that specializes in crawling entity-oriented deep-web sites and proposes techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep- web sites. Expand
Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning
TLDR
This work proposes a transfer-learning approach to EM, leveraging pre-trained EM models from large-scale, production knowledge bases (KB), and suggests that the pre- trained approach is effective and outperforms existing EM methods. Expand
TEGRA: Table Extraction by Global Record Alignment
TLDR
This work addresses the important problem of automatically extracting multi-column relational tables from such lists in a ``list'' form, and develops an efficient 2-approximation algorithm that considerably outperforms the state-of-the-art approaches in terms of quality. Expand
Concept Expansion Using Web Tables
TLDR
Novel probabilistic ranking methods are developed that can model a new type of table-entity relationship and are significantly more effective than applying state-of-the-art set expansion or holistic ranking techniques. Expand
Uni-Detect: A Unified Approach to Automated Error Detection in Tables
TLDR
This work proposes \sj, a unified framework to automatically detect diverse types of errors, and finds surprising discoveries of thousands of FD violations, numeric outliers, spelling mistakes, etc., with better accuracy than existing algorithms specifically designed for each type of errors. Expand
Utility-maximizing event stream suppression
TLDR
This paper formally defines the problem of utility-maximizing event suppression with privacy preferences, and designs a suite of real-time solutions to solve this problem, which optimally solves the problem at the event-type level. Expand
On Load Shedding in Complex Event Processing
TLDR
This paper formalizes broad classes of CEP load-shedding scenarios as different optimization problems and demonstrates an array of complexity results that reveal the hardness of these problems and construct shedding algorithms with performance guarantees. Expand
...
1
2
3
4
...