• Publications
  • Influence
LSH Ensemble: Internet-Scale Domain Search
TLDR
It is proved that there exists an optimal partitioning for any data distribution, as observed in Open Data and Web data corpora, and for datasets following a power-law distribution, it can be approximated using equi-depth.
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
TLDR
The new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets and completely out performs the state-of-the-art overlap set similarity search techniques on data lakes.
Learning Feature Engineering for Classification
TLDR
This work presents a novel technique, called Learning Feature Engineering (LFE), for automating feature engineering in classification tasks, based on learning the effectiveness of applying a transformation on numerical features, from past feature engineering experiences.
Table Union Search on Open Data
TLDR
This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes.
Making Open Data Transparent: Data Discovery on Open Data
TLDR
Open Data poses interesting new challenges for data integration research and one of those challenges is data discovery, how can the authors find new data sets within this ever expanding sea of Open Data.
Data Lake Management: Challenges and Opportunities
TLDR
This tutorial considers how data lakes are introducing new problems including dataset discovery and how they are changing the requirements for classic problems including data extraction, data cleaning, data integration, data versioning, and metadata management.
Organizing Data Lakes for Navigation
TLDR
A new probabilistic model of how users interact with an organization is presented and an approximate algorithm for the data lake organization problem is proposed that can help users find relevant tables that cannot be found by keyword search.
AWLCO: All-Window Length Co-Occurrence
TLDR
This paper considers the all-window-length analysis model which analyzes a sequence of events with respect to windows of all lengths, and proposes AWLCO, an online algorithm that computes all- window-length co-occurrences in a single pass with the expected time complexity of $O(n) and space complexity of £O( \sqrt{ n|I| })$.
Knowledge translation
TLDR
Kensho is a tool for generating mapping rules between two Knowledge Bases (KBs) that is able to automatically rank the generated mapping rules using a set of heuristics and can be used directly to exchange knowledge from source to target.
Sol–gel synthesis, structural and optical characteristics of Sr1−xZn2Si2yO7+δ: xEu2+ as a potential nanocrystalline phosphor for near-ultraviolet white light-emitting diodes
In this research, a new blue-emitting phosphor Eu2+-doped SrZn2Si2O7 was developed for white light-emitting diodes via the sol–gel process. Thermogravimetric-differential thermal analysis, X-ray
...
...