Share This Author
Anonymization of Set-Valued Data via Top-Down, Local Generalization
A top-down, partition-based approach to anonymizing set-valued data that scales linearly with the input size and scores well on an information-loss data quality metric is proposed.
ClusterJoin: A Similarity Joins Framework using Map-Reduce
- Akash Das Sarma, Yeye He, S. Chaudhuri
- Computer ScienceProceedings of the VLDB Endowment
- 1 August 2014
A ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they may produce join results based onThe distance threshold, and develops a dynamic load balancing scheme using sampling, which provides strong probabilistic guarantees on the size of partitions, and greatly improves scalability.
SEISA: set expansion by iterative similarity aggregation
A new general framework based on iterative similarity aggregation is proposed, and results are presented to show that, when using general-purpose web data for set expansion, this approach outperforms previous techniques in terms of both precision and recall.
Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning
This work proposes a transfer-learning approach to EM, leveraging pre-trained EM models from large-scale, production knowledge bases (KB), and suggests that the pre- trained approach is effective and outperforms existing EM methods.
Crawling deep web entity pages
This work describes a prototype system built that specializes in crawling entity-oriented deep-web sites and proposes techniques tailored to tackle important subproblems including query generation, empty page filtering and URL deduplication in the specific context of entity oriented deep- web sites.
TEGRA: Table Extraction by Global Record Alignment
This work addresses the important problem of automatically extracting multi-column relational tables from such lists in a ``list'' form, and develops an efficient 2-approximation algorithm that considerably outperforms the state-of-the-art approaches in terms of quality.
Uni-Detect: A Unified Approach to Automated Error Detection in Tables
This work proposes \sj, a unified framework to automatically detect diverse types of errors, and finds surprising discoveries of thousands of FD violations, numeric outliers, spelling mistakes, etc., with better accuracy than existing algorithms specifically designed for each type of errors.
Utility-maximizing event stream suppression
This paper formally defines the problem of utility-maximizing event suppression with privacy preferences, and designs a suite of real-time solutions to solve this problem, which optimally solves the problem at the event-type level.
On Load Shedding in Complex Event Processing
- Yeye He, Siddharth Barman, J. Naughton
- Engineering, Computer ScienceInternational Conference on Database Theory
- 16 December 2013
This paper formalizes broad classes of CEP load-shedding scenarios as different optimization problems and demonstrates an array of complexity results that reveal the hardness of these problems and construct shedding algorithms with performance guarantees.
Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks
This work crawled over 4M Jupyter notebooks on GitHub, and replayed them step-by-step, to observe not only full input/output tables at each step, but also the exact data-preparation choices data scientists make that they believe are best suited to the input data.