JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes

@article{Zhu2019JOSIEOS,
  title={JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes},
  author={Erkang Zhu and Dong Deng and Fatemeh Nargesian and Ren{\'e}e J. Miller},
  journal={Proceedings of the 2019 International Conference on Management of Data},
  year={2019}
}
We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set similarity search problem by considering columns as sets and matching values as intersection between sets. Although set similarity search is well-studied in the field of approximate string search (e.g., fuzzy keyword search), the solutions are… 
PEXESO: Finding Joinable Tables by Distance-based Similarities
TLDR
A novel searching problem to find joinable tables with distance-based similarities on numerical data is defined and PEXESO, a general framework that handles arbitrary threshold values and a large space of similarity functions is developed.
Adaptive Top-k Overlap Set Similarity Joins
TLDR
A solution to the top-k overlap set similarity join (TkOSSJ) that returns k pairs of sets with the highest overlap similarities is proposed and an adaptive step size algorithm is presented that is capable of automatically adjusting the step size, thus reducing redundant computations.
MATE: Multi-Attribute Table Extraction
TLDR
Mate is introduced, a table discovery system that leverages a novel hash-based index that enables n-ary join discovery through a space-efficient super key, and a filtering layer that uses a novel Hash function, Xash, which allows the system to efficiently prune tables with non-joinable rows.
Niffler: A Reference Architecture and System Implementation for View Discovery over Pathless Table Collections by Example
TLDR
This work presents a reference architecture that explicitly divides the end-to-end problem of discovering PJ-views over pathless table collections into a human and a technical problem, and presents Niffler, a system built to address the technical problem.
Relational Header Discovery using Similarity Search in a Table Corpus
TLDR
A fully automated, multi-phase system that discovers table column headers for cases where headers are missing, meaningless, or unrepresentative for the column values, which leverages existing table headers from web tables to suggest human-understandable, representative, and consistent headers for any target table.
How to reduce the search space of Entity Resolution: with Joins, Blocking or Nearest Neighbor search? [Experiment, Analysis & Benchmark Papers]
TLDR
This work performs the first systematic experimental study that investigates the relative performance of the main representatives per category over 10 real-world datasets, and considers a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets.
Scalable Data Discovery Using Profiles
TLDR
This work defines a novel notion of join quality that relies on a metric considering both the containment and cardinality proportions between candidate attributes, and implements this approach in a system called NextiaJD, and presents extensive experiments to show the predictive performance and computational efficiency of this method.
Entities with Quantities
  • G. Weikum
  • Computer Science
    IEEE Data Eng. Bull.
  • 2020
TLDR
By detecting entity mentions in web content and normalizing them onto KG entries, it has become possible to answer entity-centric queries about people, places and products almost as precisely and concisely as a database query.
How to reduce the search space of Entity Resolution: with Blocking or Nearest Neighbor search?
TLDR
This work performs the first systematic experimental study that investigates the relative performance of the main methods per type over 10 real-world datasets, demonstrating the superiority of blocking workflows and string similarity joins.
Towards Scalable Data Discovery
TLDR
This work defines a novel notion of join quality that relies on a metric considering both the containment and cardinality proportion between join candidate attributes, and is able to scale-up to larger volumes of data.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
Set Similarity Joins on MapReduce: An Experimental Survey
TLDR
This paper surveys ten recent, distributed set similarity join algorithms, all based on the MapReduce paradigm and empirically compares the algorithms in a uniform test environment on twelve datasets that expose different characteristics and represent a broad range of applications.
Leveraging Set Relations in Exact Set Similarity Join
TLDR
This paper explores index-level set relations and derives an algorithm which incrementally generates the answer of one set from an already computed answer of another similar set rather than compute the answer from scratch to reduce the computational cost.
SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints
TLDR
It is shown that selecting the optimal signature in this space of signatures is NP-complete, and based on insights from the characterization of the space, two novel filters which help to prune the candidates further before verification are proposed.
PASS-JOIN: A Partition-based Method for Similarity Joins
TLDR
This paper study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold, using a partition-based method called Pass-Join.
Table Union Search on Open Data
TLDR
This work defines the table union search problem and presents a probabilistic solution for finding tables that are unionable with a query table within massive repositories, and proposes a data-driven approach that automatically determines the best model to use for each pair of attributes.
String similarity search and join: a survey
TLDR
A comprehensive survey on string similarity search and join is presented and widely-used similarity functions to quantify the similarity are introduced, including approximate entity extraction, type-ahead search, and approximate substring matching.
Finding related tables
TLDR
This work considers the problem of finding related tables in a large corpus of heterogenous tables and proposes a framework that captures several types of relatedness, including tables that are candidates for joins and tables that is candidates for union.
An Empirical Evaluation of Set Similarity Join Techniques
TLDR
This work conducts extensive experiments on seven state-of-the-art algorithms for set similarity joins and shows that efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques.
ClusterJoin: A Similarity Joins Framework using Map-Reduce
TLDR
A ClusterJoin framework that partitions the data space based on the underlying data distribution, and distributes each record to partitions in which they may produce join results based onThe distance threshold, and develops a dynamic load balancing scheme using sampling, which provides strong probabilistic guarantees on the size of partitions, and greatly improves scalability.
Efficient Merging and Filtering Algorithms for Approximate String Searches
TLDR
This paper develops several algorithms that can greatly improve the performance of existing algorithms and studies how to integrate existing filtering techniques with these algorithms, and shows that they should be used together judiciously.
...
1
2
3
4
...