• Publications
  • Influence
LSH Ensemble: Internet-Scale Domain Search
We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment, defined as $|Q \cap X|/|Q|$, as the relevance measure ofExpand
  • 37
  • 7
  • PDF
Table Union Search on Open Data
We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. Two tables are unionable if theyExpand
  • 37
  • 3
  • PDF
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinctExpand
  • 14
  • 1
  • PDF
Making Open Data Transparent: Data Discovery on Open Data
Open Data plays a major role in open government initiatives. Governments around the world are adopting Open Data Principles promising to make their Open Data complete, primary, and timely. TheseExpand
  • 9
  • 1
  • PDF
AutoDict: Automated Dictionary Discovery
An attribute dictionary is a set of attributes together with a set of common values of each attribute. Such dictionaries are valuable in understanding unstructured or loosely structured textualExpand
  • 8
  • 1
  • PDF
Auto-Join: Joining Tables by Leveraging Transformations
Traditional equi-join relies solely on string equality comparisons to perform joins. However, in scenarios such as ad-hoc data analysis in spreadsheets, users increasingly need to join tables whoseExpand
  • 22
  • PDF
Data Lake Management: Challenges and Opportunities
The ubiquity of data lakes has created fascinating new challenges for data management research. In this tutorial, we review the state-of-the-art in data management for data lakes. We consider howExpand
  • 15
  • PDF
Searching Web Data using MinHash LSH
In this extended abstract, we explore the use of MinHash Locality Sensitive Hashing (MinHash LSH) to address the problem of indexing and searching Web data. We discuss a statistical tuning strategyExpand
  • 7
Interactive Navigation of Open Data Linkages
We developed Toronto Open Data Search to support the ad hoc, interactive discovery of connections or linkages between datasets. It can be used to efficiently navigate through the open data cloud. OurExpand
  • 9
  • PDF
Organizing Data Lakes for Navigation
We consider the problem of creating an effective navigation structure over a data lake. We define an organization as a navigation graph that contains nodes representing sets of attributes within aExpand
  • 2