Efficient parallel set-similarity joins using MapReduce

@inproceedings{Vernica2010EfficientPS,
  title={Efficient parallel set-similarity joins using MapReduce},
  author={Rares Vernica and Michael J. Carey and Chen Li},
  booktitle={SIGMOD Conference},
  year={2010}
}
In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the… CONTINUE READING

Topics from this paper.

Citations

Publications citing this paper.
SHOWING 1-10 OF 318 CITATIONS, ESTIMATED 31% COVERAGE

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics

  • 2017 IEEE 33rd International Conference on Data Engineering (ICDE)
  • 2017
VIEW 22 EXCERPTS
CITES BACKGROUND & METHODS
HIGHLY INFLUENCED

Near neighbor join

  • 2014 IEEE 30th International Conference on Data Engineering
  • 2014
VIEW 18 EXCERPTS
CITES BACKGROUND & METHODS
HIGHLY INFLUENCED

A Survey on Parallel Join Algorithms Using MapReduce on Hadoop

  • 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT)
  • 2019
VIEW 5 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

Join Algorithms under Apache Spark: Revisited

  • ICCTA 2019
  • 2019
VIEW 6 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

2 3 5 2 3 5-- ma ar rl la 5 ✔ 3 ✘ 2 ✘

  • 2018
VIEW 5 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

A comparison of most recent MapReduce joins algorithms

Sawsan Al-odibat
  • 2018
VIEW 5 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

Supporting Similarity Queries in Apache AsterixDB

  • EDBT
  • 2018
VIEW 5 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

FILTER CITATIONS BY YEAR

2010
2019

CITATION STATISTICS

  • 50 Highly Influenced Citations

  • Averaged 31 Citations per year over the last 3 years

References

Publications referenced by this paper.
SHOWING 1-10 OF 11 REFERENCES

Efficient parallel set-similarity joins using MapReduce

R. Vernica, M. Carey, C. Li
  • Technical report,
  • 2010
VIEW 4 EXCERPTS
HIGHLY INFLUENTIAL

A comparison of approaches to large-scale data analysis

  • SIGMOD Conference
  • 2009
VIEW 6 EXCERPTS
HIGHLY INFLUENTIAL

A Primitive Operator for Similarity Joins in Data Cleaning

  • 22nd International Conference on Data Engineering (ICDE'06)
  • 2006
VIEW 5 EXCERPTS
HIGHLY INFLUENTIAL

Similar Papers

Loading similar papers…