Table Union Search on Open Data

@article{Nargesian2018TableUS,
  title={Table Union Search on Open Data},
  author={Fatemeh Nargesian and Erkang Zhu and Ken Q. Pu and Ren{\'e}e J. Miller},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={813-825}
}
We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. [...] Key Method We propose a data-driven approach that automatically determines the best model to use for each pair of attributes. Through a distribution-aware algorithm, we are able to find the optimal number of attributes in two tables that can be unioned. To evaluate accuracy, we created and open-sourced a benchmark of Open Data tables. We show…Expand
Data-driven domain discovery for structured datasets
TLDR
This paper proposes a data-driven approach that leverages value co-occurrence information across a large number of dataset columns to derive robust context signatures and infer domains, which is robust and outperforms state-of-the-art methods in the presence of incomplete columns, heterogeneous or erroneous data. Expand
Relational Header Discovery using Similarity Search in a Table Corpus
TLDR
A fully automated, multi-phase system that discovers table column headers for cases where headers are missing, meaningless, or unrepresentative for the column values, which leverages existing table headers from web tables to suggest human-understandable, representative, and consistent headers for any target table. Expand
TabEAno: Table to Knowledge Graph Entity Annotation
TLDR
This work proposes a novel approach, namely TabEAno, to semantically annotate table rows toward knowledge graph entities through a "two-cells" lookup strategy based on the assumption that there is an existing logical relation occurring in the knowledge graph between the two closed cells in the same row of the table. Expand
Model-Driven Development of Web APIs to Access Integrated Tabular Open Data
TLDR
This paper proposes a model-driven approach to automatically generate Web APIs that homogeneously access multiple integrated tabular open datasets that can be integrated by means of join and union operations. Expand
JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes
TLDR
The new algorithm, JOSIE (Joining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets and completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. Expand
Niffler: A Reference Architecture and System Implementation for View Discovery over Pathless Table Collections by Example
TLDR
This work presents a reference architecture that explicitly divides the end-to-end problem of discovering PJ-views over pathless table collections into a human and a technical problem, and presents Niffler, a system built to address the technical problem. Expand
KTabulator: Interactive Ad hoc Table Creation using Knowledge Graphs
TLDR
KTabulator is an interactive system to effectively extract, build, or extend ad hoc tables from large corpora, by leveraging their computerized structures in the form of knowledge graphs. Expand
Pytheas: Pattern-based Table Discovery in CSV Files
TLDR
This work proposes Pytheas: a principled method for automatically classifying lines in a CSV file and discovering tables within it based on the intuition that tables maintain a coherency of values in each column, and introduces a confidence measure for table discovery. Expand
RONIN: Data Lake Exploration
Dataset discovery can be performed using search (with a query or keywords) to find relevant data. However, the result of this discovery can be overwhelming to explore. Existing navigation techniquesExpand
Web Table Extraction, Retrieval and Augmentation: A Survey
TLDR
The objective of this survey is to synthesize and present two decades of research on web tables into six main categories of information access tasks: table extraction, table interpretation, table search, question answering, knowledge base augmentation, and table augmentation. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 41 REFERENCES
Finding related tables
TLDR
This work considers the problem of finding related tables in a large corpus of heterogenous tables and proposes a framework that captures several types of relatedness, including tables that are candidates for joins and tables that is candidates for union. Expand
Annotating and searching web tables using entities, types and relationships
TLDR
This paper proposes new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns seek to express, and a new graphical model for making all these labeling decisions for each table simultaneously. Expand
Recovering Semantics of Tables on the Web
TLDR
A system that attempts to recover the semantics of tables by enriching the table with additional annotations, which leverages a database of class labels and relationships automatically extracted from the Web. Expand
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
TLDR
A novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time is proposed and has significantly higher precision and coverage and four orders of magnitude faster response times compared with the state-of-the-art approach. Expand
HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching
TLDR
This work addresses the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse by proposing a new technique based on the search engine's clicklogs. Expand
Discovering Linkage Points over Web Data
TLDR
The basic schema-matching step is replaced with a more complex instance-based schema analysis and linkage discovery, and it is shown that even attributes with different meanings can sometimes be useful in aligning data. Expand
WebTables: exploring the power of tables on the web
TLDR
The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine. Expand
Answering Table Queries on the Web using Column Keywords
TLDR
The design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns is presented and a novel query segmentation model for matching keywords to table columns is defined. Expand
Matching HTML Tables to DBpedia
TLDR
This paper presents the T2D gold standard for measuring and comparing the performance of HTML table to knowledge base matching systems, and shows that T2K Match discovers table-to-class correspondences with a precision of 94%, row/columns and entities/schema elements of the knowledge base need to be found. Expand
LSH Ensemble: Internet-Scale Domain Search
TLDR
This work presents a new index structure, Locality Sensitive Hashing Ensemble, that solves the domain search problem using set containment at Internet scale, and proves that there exists an optimal partitioning for any distribution. Expand
...
1
2
3
4
5
...