• Publications
  • Influence
LSH Ensemble: Internet-Scale Domain Search
TLDR
We present a new index structure, Locality Sensitive Hashing (LSH) Ensemble, that solves the domain search problem using set containment at Internet scale. Expand
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
TLDR
We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. Expand
Table Union Search on Open Data
TLDR
We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. Expand
Making Open Data Transparent: Data Discovery on Open Data
TLDR
We present new table join and table union search solutions that provide interactive search speed even over massive collections of millions of attributes with heavily skewed cardinality distributions. Expand
Data Lake Management: Challenges and Opportunities
TLDR
We consider how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management. Expand
Organizing Data Lakes for Navigation
TLDR
We present a new probabilistic model of how users interact with an organization and propose an approximate algorithm for the data lake organization problem. Expand
AutoDict: Automated Dictionary Discovery
TLDR
We present AutoDict, a novel dictionary discovery tool that incorporates a set of measures including information content, similarity, and conviction, to produce relevant and accurate dictionaries. Expand
Parallelizing Filter-Verification Based Exact Set Similarity Joins on Multicores
TLDR
We propose a data-parallelization execution model along with various design considerations, including the use of filters, CPU affinity, record inlining and batch inlining to improve locality. Expand
FLAML: A Fast and Lightweight AutoML Library
TLDR
We study the problem of using low computational cost to automate the choices of learners and hyperparameters for an ad-hoc dataset and error metric, by conducting trials of different configurations on the given training data. Expand
Auto-Join: Joining Tables by Leveraging Transformations
TLDR
We developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose execution makes input tables equi-join-able. Expand
...
1
2
...