• Corpus ID: 238252940

MATE: Multi-Attribute Table Extraction

  title={MATE: Multi-Attribute Table Extraction},
  author={Mahdi Esmailoghli and Jorge-Arnulfo Quian'e-Ruiz and Ziawasch Abedjan},
  journal={Proc. VLDB Endow.},
A core operation in data discovery is to find joinable tables for a given table. Real-world tables include both unary and n-ary join keys. However, existing table discovery systems are optimized for unary joins and are ineffective and slow in the existence of n-ary keys. In this paper, we introduce Mate, a table discovery system that leverages a novel hash-based index that enables n-ary join discovery through a space-efficient super key. We design a filtering layer that uses a novel hash, Xash… 

Figures and Tables from this paper


Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach
PEXESO is proposed, a framework for joinable table discovery in data lakes that identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks.
Unary and n-ary inclusion dependency discovery in relational databases
This paper proposes a two-step approach to data-mining algorithms for inclusion dependency inference in a given database, and shows how approximate INDs, which almost hold, can be safely integrated into the unary and n-ary discovery algorithms.
Correlation Sketches for Approximate Join-Correlation Queries
A sketching method is proposed that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and different scoring strategies are explored that effectively rank the query results based on how well the columns are correlated with the query.
Divide & Conquer-based Inclusion Dependency Discovery
Binder, an IND detection system that is capable of detecting both unary and n-ary INDs, is proposed, based on a divide & conquer approach, which allows to handle very large datasets -- an important property on the face of the ever increasing size of today's data.
Auto-Join: Joining Tables by Leveraging Transformations
This work has developed Auto-Join, a system that can automatically search over a rich space of operators to compose a transformation program, whose execution makes input tables equi-join-able, and developed an optimal sampling strategy that allows Auto- join to scale to large datasets efficiently, while ensuring joins succeed with high probability.
Efficient Similarity Join and Search on Multi-Attribute Data
It is proved that constructing an optimal prefix tree is NP-complete and developed a greedy algorithm to achieve high performance and extend the cost model to support similarity search and devise a budget-based algorithm to construct multiple high-quality prefix trees.
Top-k Set Similarity Joins
An algorithm, topk-join, is proposed to answer top-k similarity join efficiently, based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs.
InfoGather: entity augmentation and attribute discovery by holistic matching with web tables
A novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time is proposed and has significantly higher precision and coverage and four orders of magnitude faster response times compared with the state-of-the-art approach.
Scalable Discovery of Unique Column Combinations
This paper devise Ducc, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets, which first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations, and presents the hybrid column-based pruning technique.
To Join or Not to Join?: Thinking Twice about Joins before Feature Selection
This work identifies the core technical issue that could cause accuracy to decrease in some cases and analyzes this issue theoretically to design easy-to-understand decision rules to predict when it is safe to avoid joins, which led to significant reductions in the runtime of some popular feature selection methods.