Auto-Join: Joining Tables by Leveraging Transformations

@article{Zhu2017AutoJoinJT,
  title={Auto-Join: Joining Tables by Leveraging Transformations},
  author={Erkang Zhu and Yeye He and Surajit Chaudhuri},
  journal={Proc. VLDB Endow.},
  year={2017},
  volume={10},
  pages={1034-1045}
}
Traditional equi-join relies solely on string equality comparisons to perform joins. However, in scenarios such as ad-hoc data analysis in spreadsheets, users increasingly need to join tables whose join-columns are from the same semantic domain but use different textual representations, for which transformations are needed before equi-join can be performed. We developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose… 
MATE: Multi-Attribute Table Extraction
TLDR
Mate is introduced, a table discovery system that leverages a novel hash-based index that enables n-ary join discovery through a space-efficient super key, and a filtering layer that uses a novel Hash function, Xash, which allows the system to efficiently prune tables with non-joinable rows.
Auto-transform
TLDR
This work demonstrates that there is a rich class of transformations in TBP that can be "learned" from large collections of paired table columns, and shows the proposed method can harvest such transformations across diverse domains and corpora.
Transform-Data-by-Example (TDE): An Extensible Search Engine for Data Transformations
TLDR
Transform-Data-by-Example (TDE) is developed, which works like a search engine for data transformations, so that users only need to provide a few input/output examples to demonstrate a desired transformation, and TDE can interactively find relevant functions to synthesize new programs consistent with all examples.
Putting Things into Context: Rich Explanations for Query Answers using Join Graphs
TLDR
This work proposes a new approach for explaining query results by augmenting provenance with information from other related tables in the database by using a suite of optimization techniques.
Transform-Data-by-Example (TDE): Extensible Data Transformation in Excel
TLDR
An extensible data transformation system called Transform-Data-by-Example (TDE) that can leverage rich transformation logic in source code, DLLs, web services and mapping tables, so that end-users only need to provide a few input/output examples, and TDE can synthesize desired programs using relevant transformation logic from these sources.
PEXESO: Finding Joinable Tables by Distance-based Similarities
TLDR
A novel searching problem to find joinable tables with distance-based similarities on numerical data is defined and PEXESO, a general framework that handles arbitrary threshold values and a large space of similarity functions is developed.
Interactive rule correction, imputation and execution in rule-driven database completion system
  • K. Reddy
  • Computer Science
    2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
  • 2020
TLDR
This paper solves the problem of correcting database completion rules, imputing missing rule conditions, and executing them interactively and efficiently by leveraging programming-by-example data transformations, sketching data structures such as bloom filters, and leveraging entity resolution rules.
Technical Report: Optimizing Human Involvement for Entity Matching and Consolidation
TLDR
This paper proposes a human-in-the-loop framework that interleaves different types of questions to optimize human involvement, and develops a question scheduling framework that judiciously selects questions to maximize the accuracy of the final golden records.
Auto-Transform: Learning-to-Transform by Patterns
TLDR
It is demonstrated that there is a rich class of transformations in TBP that can be “learned” from large collections of paired table columns, and it is shown the proposed method can harvest such transformations across diverse domains and corpora.
Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment
TLDR
LAZO is a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set.
...
1
2
3
4
...

References

SHOWING 1-10 OF 23 REFERENCES
SEMA-JOIN: Joining Semantically-Related Tables Using Big Table Corpora
TLDR
The main idea is to utilize a data-driven method that leverages a big table corpus with over 100 million tables to determine statistical correlation between cell values at both row-level and column-level to formulate the join prediction problem as an optimization problem.
A Primitive Operator for Similarity Joins in Data Cleaning
TLDR
This paper proposes a new primitive operator which can be used as a foundation to implement similarity joins according to a variety of popular string similarity functions, and notions of similarity which go beyond textual similarity.
TEGRA: Table Extraction by Global Record Alignment
TLDR
This work addresses the important problem of automatically extracting multi-column relational tables from such lists in a ``list'' form, and develops an efficient 2-approximation algorithm that considerably outperforms the state-of-the-art approaches in terms of quality.
Spreadsheet table transformations from examples
TLDR
An automatic technique that takes from a user an example of how the user needs to transform a table of data, and provides to the user a program that implements the transformation described by the example, and presents a language of programs TableProg that can describe transformations that real users require.
Foofah: Transforming Data By Example
TLDR
This paper develops a technique to synthesize data transformation programs by example, reducing this burden by allowing the analyst to describe the transformation with a small input-output example pair, without being concerned with the transformation steps required to get there.
BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations
TLDR
A data structure InputDataGraph is developed to succinctly represent a large set of logical patterns that are shared across the input data, and used to efficiently learn substring expressions in a new PBE system BlinkFill.
Harvesting Relational Tables from Lists on the Web
TLDR
This work proposes a novel technique for extracting tables from lists, which is domain-independent and operates in a fully unsupervised manner, and believes that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the Web.
Mining database structure; or, how to build a data quality browser
TLDR
Techniques for quickly identifying which fields have similar values, identifying join paths, estimating join directions and sizes, and identifying structures in the database are presented.
Discovering Linkage Points over Web Data
TLDR
The basic schema-matching step is replaced with a more complex instance-based schema analysis and linkage discovery, and it is shown that even attributes with different meanings can sometimes be useful in aligning data.
iMAP: discovering complex semantic matches between database schemas
TLDR
The iMAP system is described, which semi-automatically discovers both 1-1 and complex matches, and introduces a novel feature that generates explanation of predicted matches, to provide insights into the matching process and suggest actions to converge on correct matches quickly.
...
1
2
3
...