Table Union Search on Open Data

@article{Nargesian2018TableUS,
  title={Table Union Search on Open Data},
  author={Fatemeh Nargesian and Erkang Zhu and Ken Q. Pu and Ren{\'e}e J. Miller},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={813-825}
}
We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. [] Key Method We propose a data-driven approach that automatically determines the best model to use for each pair of attributes. Through a distribution-aware algorithm, we are able to find the optimal number of attributes in two tables that can be unioned. To evaluate accuracy, we created and open-sourced a benchmark of Open Data tables. We show…

Figures and Tables from this paper

Data-driven domain discovery for structured datasets

TLDR
This paper proposes a data-driven approach that leverages value co-occurrence information across a large number of dataset columns to derive robust context signatures and infer domains, which is robust and outperforms state-of-the-art methods in the presence of incomplete columns, heterogeneous or erroneous data.

Relational Header Discovery using Similarity Search in a Table Corpus

TLDR
A fully automated, multi-phase system that discovers table column headers for cases where headers are missing, meaningless, or unrepresentative for the column values, which leverages existing table headers from web tables to suggest human-understandable, representative, and consistent headers for any target table.

TabEAno: Table to Knowledge Graph Entity Annotation

TLDR
This work proposes a novel approach, namely TabEAno, to semantically annotate table rows toward knowledge graph entities through a "two-cells" lookup strategy based on the assumption that there is an existing logical relation occurring in the knowledge graph between the two closed cells in the same row of the table.

Model-Driven Development of Web APIs to Access Integrated Tabular Open Data

TLDR
This paper proposes a model-driven approach to automatically generate Web APIs that homogeneously access multiple integrated tabular open datasets that can be integrated by means of join and union operations.

Extracting N-ary Facts from Wikipedia Table Clusters

TLDR
This paper proposes a novel knowledge extraction technique that transforms and clusters similar tables into fewer unified ones to overcome the problem of table diversity and applies a technique that relies on functional dependencies to judiciously interpret the table and extract n-ary relations.

Niffler: A Reference Architecture and System Implementation for View Discovery over Pathless Table Collections by Example

TLDR
This work presents a reference architecture that explicitly divides the end-to-end problem of discovering PJ-views over pathless table collections into a human and a technical problem, and presents Niffler, a system built to address the technical problem.

KTabulator: Interactive Ad hoc Table Creation using Knowledge Graphs

TLDR
KTabulator is an interactive system to effectively extract, build, or extend ad hoc tables from large corpora, by leveraging their computerized structures in the form of knowledge graphs.

Merging Web Tables for Relation Extraction with Knowledge Graphs

TLDR
This work proposes an observed schema for individual tables, which is used to group and merge tables, and compares the precision and number of triples extracted with and without table merging, where it is shown that with merging, a larger number ofTriples are extracted at a similar precision.

MATE: Multi-Attribute Table Extraction

TLDR
Mate is introduced, a table discovery system that leverages a novel hash-based index that enables n-ary join discovery through a space-efficient super key, and a filtering layer that uses a novel Hash function, Xash, which allows the system to efficiently prune tables with non-joinable rows.

Towards open data discovery: a comparative study

TLDR
This work presents a comparative study involving three different methods: a hybrid algorithm based on Linear Discriminant Analysis and Word2Vec, Cosine similarity measure, and a Semantic Test proposed for Open Data search, evaluated on its ability to discover, among eight open datasets, the most likely one to meet an input question.
...

References

SHOWING 1-10 OF 41 REFERENCES

Finding related tables

TLDR
This work considers the problem of finding related tables in a large corpus of heterogenous tables and proposes a framework that captures several types of relatedness, including tables that are candidates for joins and tables that is candidates for union.

Annotating and searching web tables using entities, types and relationships

TLDR
This paper proposes new machine learning techniques to annotate table cells with entities that they likely mention, table columns with types from which entities are drawn for cells in the column, and relations that pairs of table columns seek to express, and a new graphical model for making all these labeling decisions for each table simultaneously.

Recovering Semantics of Tables on the Web

TLDR
A system that attempts to recover the semantics of tables by enriching the table with additional annotations, which leverages a database of class labels and relationships automatically extracted from the Web.

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

TLDR
A novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time is proposed and has significantly higher precision and coverage and four orders of magnitude faster response times compared with the state-of-the-art approach.

HAMSTER: Using Search Clicklogs for Schema and Taxonomy Matching

TLDR
This work addresses the problem of unsupervised matching of schema information from a large number of data sources into the schema of a data warehouse by proposing a new technique based on the search engine's clicklogs.

WebTables: exploring the power of tables on the web

TLDR
The WEBTABLES system develops new techniques for keyword search over a corpus of tables, and shows that they can achieve substantially higher relevance than solutions based on a traditional search engine.

Answering Table Queries on the Web using Column Keywords

TLDR
The design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns is presented and a novel query segmentation model for matching keywords to table columns is defined.

Matching HTML Tables to DBpedia

TLDR
This paper presents the T2D gold standard for measuring and comparing the performance of HTML table to knowledge base matching systems, and shows that T2K Match discovers table-to-class correspondences with a precision of 94%, row/columns and entities/schema elements of the knowledge base need to be found.

Holistic Schema Matching for Web Query Interfaces

TLDR
A count-based greedy algorithm to identify which attributes are more likely to be matched in the query interfaces of real Web databases, Holistic Schema Matching (HSM), which can identify both simple matching i.e., 1:1 matching, and complex matching, i.

LSH Ensemble: Internet-Scale Domain Search

TLDR
It is proved that there exists an optimal partitioning for any data distribution, as observed in Open Data and Web data corpora, and for datasets following a power-law distribution, it can be approximated using equi-depth.