Corpus ID: 16322742

Optimizing Shuffle Performance in Spark

  title={Optimizing Shuffle Performance in Spark},
  author={A. Davidson},
  • A. Davidson
  • Published 2013
  • Spark [6] is a cluster framework that performs in-memory computing, with the goal of outperforming disk-based engines like Hadoop [2]. As with other distributed data processing platforms, it is common to collect data in a manyto-many fashion, a stage traditionally known as the shuffle phase. In Spark, many sources of inefficiency exist in the shuffle phase that, once addressed, potentially promise vast performance improvements. In this paper, we identify the bottlenecks in the execution of the… CONTINUE READING

    Figures and Tables from this paper.

    Locality-based Partitioning for Spark
    • 1
    • Open Access
    Shuffle phase optimization in spark
    • 1
    Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics
    • 7
    • Open Access
    An Elastic Data Persisting Solution with High Performance for Spark
    • 2
    A Methodology for Spark Parameter Tuning
    • 20
    • Open Access
    Magnet: Push-based Shuffle Service for Large-scale Data Processing


    Publications referenced by this paper.
    Understanding TCP incast throughput collapse in datacenter networks
    • 363
    • Open Access
    TritonSort: A Balanced Large-Scale Sorting System
    • 76
    • Highly Influential
    • Open Access
    C-Store: A Column-oriented DBMS
    • 1,119
    • Open Access