Author pages are created from data sourced from our academic publisher partnerships and public sources.
Share This Author
LSH Ensemble: Internet-Scale Domain Search
It is proved that there exists an optimal partitioning for any data distribution, as observed in Open Data and Web data corpora, and for datasets following a power-law distribution, it can be approximated using equi-depth.
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
The new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets and completely out performs the state-of-the-art overlap set similarity search techniques on data lakes.
Table Union Search on Open Data
This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes.
Making Open Data Transparent: Data Discovery on Open Data
- Renée J. Miller, F. Nargesian, Erkang Zhu, Christina Christodoulakis, K. Pu, P. Andritsos
- Computer ScienceIEEE Data Eng. Bull.
Open Data poses interesting new challenges for data integration research and one of those challenges is data discovery, how can the authors find new data sets within this ever expanding sea of Open Data.
FLAML: A Fast and Lightweight AutoML Library
A fast and lightweight library FLAML is built which optimizes for low computational resource in finding accurate models and significantly outperforms top-ranked AutoML libraries on a large open source AutoML benchmark under equal, or sometimes orders of magnitude smaller budget constraints.
Data Lake Management: Challenges and Opportunities
- F. Nargesian, Erkang Zhu, Renée J. Miller, K. Pu, Patricia C. Arocena
- Computer ScienceProc. VLDB Endow.
- 1 August 2019
This tutorial considers how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management.
Organizing Data Lakes for Navigation
- F. Nargesian, K. Pu, Erkang Zhu, Bahar Ghadiri Bashardoost, Renée J. Miller
- Computer ScienceSIGMOD Conference
- 29 May 2020
A new probabilistic model of how users interact with an organization is presented and an approximate algorithm for the data lake organization problem is proposed that can help users find relevant tables that cannot be found by keyword search.
Auto-Join: Joining Tables by Leveraging Transformations
This work has developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose execution makes input tables equi-join-able, and developed an optimal sampling strategy that allows Auto- join to scale to large datasets efficiently, while ensuring joins succeed with high probability.
Parallelizing Filter-Verification Based Exact Set Similarity Joins on Multicores
This paper adapts state-of-the-art SSJ algorithms including PPJoin and AllPairs and finds that using the exact number of hardware-provided hyperthreads leads to optimal runtimes for most experiments, and hand-crafted data structures do not always lead to better performance.
AutoDict: Automated Dictionary Discovery
- Fei Chiang, P. Andritsos, Erkang Zhu, Renée J. Miller
- Computer ScienceIEEE 28th International Conference on Data…
- 1 April 2012
This demonstration will showcase the different information analysis and extraction features within AutoDict, and highlight the process of generating high quality attribute dictionaries.