Evaluating End-to-End Optimization for Data Analytics Applications in Weld

  title={Evaluating End-to-End Optimization for Data Analytics Applications in Weld},
  author={Shoumik Palkar and James J. Thomas and Deepak Narayanan and Pratiksha Thaker and Rahul Palamuttam and Parimarjan Negi and Anil Shanbhag and Malte Schwarzkopf and Holger Pirk and Saman P. Amarasinghe and Samuel Madden and Matei A. Zaharia},
  journal={Proc. VLDB Endow.},
Modern analytics applications use a diverse mix of libraries and functions. [] Key Method Our optimizer eliminates multiple forms of overhead that arise when composing imperative libraries like Pandas and NumPy, and uses lightweight measurements to make data-dependent decisions at run-time in ad-hoc workloads where no statistics are available, with sub-second overhead. We also evaluate which optimizations have the largest impact in practice and whether Weld can be integrated into libraries incrementally. Our…

Optimizing data-intensive computations in existing libraries with split annotations

Mozart is implemented, a new technique called split annotations (SAs) that enables key data movement optimizations over unmodified library functions and provides performance gains competitive with solutions that require rewriting libraries, and can sometimes outperform these systems by up to 2x by leveraging existing hand-optimized code.

Compilation-assisted performance acceleration for data analytics

This thesis presents KeyChain, a CPM implementation that identifies equivalent UDFs and has low overhead so CPM can be always be enabled, a system that enables existing data processing systems to execute relational queries using network processors.

Snippet from the Black Scholes options pricing benchmark implemented using Intel MKL

This paper proposes a new technique called split annotations (SAs) that enables key data movement optimizations over unmodified library functions and implements a parallel runtime for SAs in a system called Mozart, which can accelerate workloads in libraries such as Intel MKL and Pandas by up to 15×.

Offload Annotations: Bringing Heterogeneous Computing to Existing Libraries and Workloads

This work presents a new approach called offload annotations (OAs) that enables heterogeneous GPU computing in existing workloads with few or no code modifications and matches the performance of handwritten heterogeneous implementations.

Loop-Adaptive Execution in Weld

This thesis investigates the implementation and benefits of loop-adaptivity in Weld, a data-centric JIT-compiling data processing framework, and finds that it can gain modest improvements, limited to higher scale factors.

Acorn: Aggressive Result Caching in Distributed Data Processing Frameworks

This paper introduces a judicious adaptation of predicate analysis on analyzed query plans that avoids unnecessary query optimization, and presents a UDF translator that transparently compiles UDFs from general purpose languages into native equivalents.

Modularis: Modular Data Analytics for Hardware, Software, and Platform Heterogeneity

Modularis is an execution layer for data analytics based on fine-grained, composable building blocks that are as generic and simple as possible and an order of magnitude faster on SQL-style analytics compared to a commonly used framework for generic data processing and on par with a commercial cluster database.

HorsePower: Accelerating Database Queries for Advanced Data Analytics

This paper proposes an advanced analytical system HorsePower, based on HorseIR, an array-based intermediate representation (IR), designed for the translation of conventional database queries, statistical languages, as well as the mix of these two into a common IR, allowing to combine query optimization and compiler optimization techniques at an intermediate level of abstraction.

Optimizing end-to-end machine learning pipelines for model training

It is concluded that a holistic system design that covers all tiers – programming abstraction, intermediate representation, and execution backend – to overcome the scalability challenges of large-scale data analysis programs is needed.

Jumpgate: automating integration of network connected accelerators

Jumpgate is presented, a system that simplifies integration of existing NCA code into data analytics systems, such as Apache Spark or Presto, and places most of the integration code into the analytics system, leaving NCA programmers to write only a couple hundred lines of code to integrate new NCAs.



A Common Runtime for High Performance Data Analysis

Weld is proposed, a runtime for data-intensive applications that optimizes across disjoint libraries and functions that uses a common intermediate representation to capture the structure of diverse dataparallel workloads, including SQL, machine learning and graph analytics.

An Architecture for Compiling UDF-centric Workflows

A novel architecture for automatically compiling workflows of UDFs is described and several optimizations that consider properties of the data,UDFs, and hardware together in order to generate different code on a case-by-case basis are proposed.

Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns

This paper introduces the Distributed Multiloop Language (DMLL), a new intermediate language based on common parallel patterns that captures the necessary semantic knowledge to efficiently target distributed heterogeneous architectures and shows straightforward analyses that determine what data to distribute based on its usage.

Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware

This work uses Voodoo, a declarative intermediate algebra that abstracts the detailed architectural properties of the hardware, such as multi- or many-core architectures, caches and SIMD registers, without losing the ability to generate highly tuned code, to build an alternative backend for MonetDB, a popular open-source in-memory database.

Implicit Parallelism through Deep Language Embedding

It is argued that fixing the abstraction leaks exposed by these patterns will reduce the cost of data analysis due to improved programmer productivity and propose a simplified API that provides proper support for nested collection processing and alleviates the need of certain second-order primitives through comprehensions.

Peeking into the optimization of data flow programs with MapReduce-style UDFs

An optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language is demonstrated and a job submission client is demonstrated that allows users to peek step-by-step into each phase of the optimization process.

Main-memory scan sharing for multi-core CPUs

This work proposes a novel FullSharing scheme that allows all concurrent queries, when performing base-table I/O, to share the cache belonging to a given core, and uses lottery-scheduling techniques to ensure fairness and impose a hard upper bound on staging time to avoid starvation.

How to Architect a Query Compiler

This paper proposes to use a stack of multiple DSLs on different levels of abstraction with lowering in multiple steps to make query compilers easier to build and extend, ultimately allowing us to create more convincing and sustainable compiler-based data management systems.

Efficiently Compiling Efficient Query Plans for Modern Hardware

This work presents a novel compilation strategy that translates a query into compact and efficient machine code using the LLVM compiler framework and integrates these techniques into the HyPer main memory database system and shows that this results in excellent query performance while requiring only modest compilation time.

Making Sense of Performance in Data Analytics Frameworks

It is found that CPU (and not I/O) is often the bottleneck, and improving network performance can improve job completion time by a median of at most 2%, and the causes of most stragglers can be identified.