JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes

@article{Zhu2019JOSIEOS,
  title={JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes},
  author={Erkang Zhu and Dong Deng and F. Nargesian and R. Miller},
  journal={Proceedings of the 2019 International Conference on Management of Data},
  year={2019}
}
  • Erkang Zhu, Dong Deng, +1 author R. Miller
  • Published 2019
  • Computer Science
  • Proceedings of the 2019 International Conference on Management of Data
We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set similarity search problem by considering columns as sets and matching values as intersection between sets. Although set similarity search is well-studied in the field of approximate string search (e.g., fuzzy keyword search), the solutions are… Expand
27 Citations
Adaptive Top-k Overlap Set Similarity Joins
  • 1
Data Lake Organization
  • 1
  • PDF
Scalable Data Discovery Using Profiles
  • 1
  • PDF
Entities with Quantities
  • G. Weikum
  • Computer Science
  • IEEE Data Eng. Bull.
  • 2020
  • 1
  • PDF
Towards Scalable Data Discovery
  • PDF
Organizing Data Lakes for Navigation
  • 9
Discovering Related Data At Scale
  • Highly Influenced
  • PDF
Finding Related Tables in Data Lakes for Interactive Data Science
  • 14
  • Highly Influenced
  • PDF
...
1
2
3
...

References

SHOWING 1-6 OF 6 REFERENCES
WebTables: exploring the power of tables on the web
  • 612
  • Highly Influential
  • PDF
Answering approximate string queries on large data sets using external memory
  • 28
  • Highly Influential
  • PDF
Efficient similarity joins for near duplicate detection
  • 147
  • Highly Influential
Top-k Set Similarity Joins
  • 185
  • Highly Influential
  • PDF
Scaling up all pairs similarity search
  • 712
  • Highly Influential
  • PDF
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
  • 53
  • Highly Influential
  • PDF