In-Memory Indexed Caching for Distributed Data Processing

  title={In-Memory Indexed Caching for Distributed Data Processing},
  author={Alexandru Uta and Bogdan Ghit and Ankur Dave and Jan S. Rellermeyer and Peter A. Boncz},
  journal={2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
  • Alexandru UtaBogdan Ghit P. Boncz
  • Published 12 December 2021
  • Computer Science
  • 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup… 



LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data

This work builds two new layers over Spark, namely a query scheduler and a query executor, and embeds an efficient spatial Bloom filter into LocationSpark's indexes to avoid unnecessary network communication overhead when processing overlapped spatial data.

Indexing for Large Scale Data Querying Based on Spark SQL

An indexing structure which is a pluggable component of Spark SQL based on Apache Spark is presented, which enables programmers to load fine-grained data file of structured data into memory, which is flexible to load "hot data" into memory and to evict "cold data" out of memory.

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

Simba: Efficient In-Memory Spatial Analytics

Simba is a scalable and efficient in-memory spatial query processing and analytics for big spatial data that extends the Spark SQL engine to support rich spatial queries and analytics through both SQL and the DataFrame API.

Workload characterization and optimization of TPC-H queries on Apache Spark

This paper used the TPC-H benchmark as the optimization case study and gathered many perspective logs such as application, JVM, OS parameters, Spark configuration, and application code based on CPU characteristics to introduce several JVM and OS parameter optimization approaches for accelerating Spark performance.

SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics

SnappyData is presented as the first unified engine capable of delivering analytics, transactions, and stream processing in a single integrated cluster by carefully marrying a big data computational engine with a scale-out transactional store.

Naiad: a timely dataflow system

It is shown that many powerful high-level programming models can be built on Naiad's low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining.

Riffle: optimized shuffle service for large-scale data analytics

Riffle is presented, an optimized shuffle service for big-data analytics frameworks that significantly improves I/O efficiency and scales to process petabytes of data.

Towards zero-overhead static and adaptive indexing in Hadoop

HAIL (Hadoop Aggressive Indexing Library), a novel indexing approach for HDFS and Hadoop MapReduce, which creates different clustered indexes over terabytes of data with minimal, often invisible costs, and it dramatically improves runtimes of several classes of Map Reduce jobs.

Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark

Structured Streaming is a new high-level streaming API in Apache Spark based on the experience with Spark Streaming that achieves high performance via Spark SQL's code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x.