• Publications
  • Influence
LSH Ensemble: Internet-Scale Domain Search
TLDR
It is proved that there exists an optimal partitioning for any data distribution, as observed in Open Data and Web data corpora, and for datasets following a power-law distribution, it can be approximated using equi-depth.
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
TLDR
The new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets and completely out performs the state-of-the-art overlap set similarity search techniques on data lakes.
Table Union Search on Open Data
TLDR
This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes.
Making Open Data Transparent: Data Discovery on Open Data
TLDR
Open Data poses interesting new challenges for data integration research and one of those challenges is data discovery, how can the authors find new data sets within this ever expanding sea of Open Data.
FLAML: A Fast and Lightweight AutoML Library
TLDR
A fast and lightweight library FLAML is built which optimizes for low computational resource in finding accurate models and significantly outperforms top-ranked AutoML libraries on a large open source AutoML benchmark under equal, or sometimes orders of magnitude smaller budget constraints.
Data Lake Management: Challenges and Opportunities
TLDR
This tutorial considers how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management.
Organizing Data Lakes for Navigation
TLDR
A new probabilistic model of how users interact with an organization is presented and an approximate algorithm for the data lake organization problem is proposed that can help users find relevant tables that cannot be found by keyword search.
Auto-Join: Joining Tables by Leveraging Transformations
TLDR
This work has developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose execution makes input tables equi-join-able, and developed an optimal sampling strategy that allows Auto- join to scale to large datasets efficiently, while ensuring joins succeed with high probability.
Parallelizing Filter-Verification Based Exact Set Similarity Joins on Multicores
TLDR
This paper adapts state-of-the-art SSJ algorithms including PPJoin and AllPairs and finds that using the exact number of hardware-provided hyperthreads leads to optimal runtimes for most experiments, and hand-crafted data structures do not always lead to better performance.
AutoDict: Automated Dictionary Discovery
TLDR
This demonstration will showcase the different information analysis and extraction features within AutoDict, and highlight the process of generating high quality attribute dictionaries.
...
1
2
...