Efficiently Transforming Tables for Joinability

  title={Efficiently Transforming Tables for Joinability},
  author={Arash Dargahi Nobari and Davood Rafiei},
  journal={2022 IEEE 38th International Conference on Data Engineering (ICDE)},
Data from different sources rarely conform to a single formatting even if they describe the same set of entities, and this raises concerns when data from multiple sources must be joined or cross-referenced. Such a formatting mismatch is unavoidable when data is gathered from various public and third-party sources. Commercial database systems are not able to perform the join when there exist differences in data representation or formatting, and manual reformatting is both time consuming and… 

Figures and Tables from this paper

BareTQL: An Interactive System for Searching and Extraction of Open Data Tables

BareTQL is presented, an interactive system for querying open data tables in the presence of the aforementioned challenges, which aims to provide an easy and efficient way of querying incomplete data in tables with little or no schema.



Auto-Join: Joining Tables by Leveraging Transformations

This work has developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose execution makes input tables equi-join-able, and developed an optimal sampling strategy that allows Auto- join to scale to large datasets efficiently, while ensuring joins succeed with high probability.

Spreadsheet data manipulation using examples

This work presents a programming by example methodology that allows end users to automate such repetitive tasks over large spreadsheet data by designing a domain-specific language and developing a synthesis algorithm that can learn programs in that language from user-provided examples.

BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations

A data structure InputDataGraph is developed to succinctly represent a large set of logical patterns that are shared across the input data, and used to efficiently learn substring expressions in a new PBE system BlinkFill.

Fast-join: An efficient method for fuzzy token matching based string similarity join

This paper proposes a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions by allowing fuzzy match between two tokens, and achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.

Semantic Table Retrieval using Keyword and Table Queries

This work proposes a semantic table retrieval framework for matching information needs (keyword or table queries) against tables in multiple semantic spaces and introduces various similarity measures for matching those semantic representations.

WebLens: Towards Web-scale Data Integration, Training the Models

WebLens, a scalable data integration system, first, trains Deep Learning models to find and match semantically similar tables, then derives mediated schemas for these subsets to enable uniform access to all relevant data.

Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples

Experiments suggest that the proposed Auto-FuzzyJoin significantly outperforms existing unsupervised approaches, and is surprisingly competitive even against supervised approaches (e.g., Magellan and DeepMatcher) when 50% of ground-truth labels are used as training data.


This work demonstrates that there is a rich class of transformations in TBP that can be "learned" from large collections of paired table columns, and shows the proposed method can harvest such transformations across diverse domains and corpora.

Interactive Mapping Specification with Exemplar Tuples

This article presents an interactive framework for schema mapping specification suited for non-expert users, and presents a quasi-lattice-based exploration of the space of all possible mappings that satisfy arbitrary user exemplar tuples.

Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning

This work proposes a transfer-learning approach to EM, leveraging pre-trained EM models from large-scale, production knowledge bases (KB), and suggests that the pre- trained approach is effective and outperforms existing EM methods.