A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics

@article{Thomas2018ACE,
  title={A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics},
  author={Anthony Thomas and Arun Kumar},
  journal={Proc. VLDB Endow.},
  year={2018},
  volume={11},
  pages={2168-2182}
}
The growing use of statistical and machine learning (ML) algorithms to analyze large datasets has given rise to new systems to scale such algorithms. But implementing new scalable algorithms in low-level languages is a painful process, especially for enterprise and scientific users. To mitigate this issue, a new breed of systems expose high-level bulk linear algebra (LA) primitives that are scalable. By composing such LA primitives, users can write analysis algorithms in a higher-level… 
BigDataBench: A Scalable and Unified Big Data and AI Benchmark Suite
TLDR
A unified big data and AI benchmark suite sheds new light on domain-specific hardware and software co-design: tailoring the system and architecture to characteristics of the unified eight data motifs other than one or more application case by case.
HADAD: A Lightweight Approach for Optimizing Hybrid Complex Analytics Queries
TLDR
HADAD is proposed, an extensible lightweight approach for optimizing hybrid complex analytics queries, based on a common abstraction that facilitates unified reasoning: a relational model endowed with integrity constraints that can be naturally and portably applied on top of pure LA and hybrid RA-LA platforms without modifying their internals.
BigDataBench: A Dwarf-based Big Data and AI Benchmark Suite
TLDR
This work comprehensively characterize the benchmarks of seven workload types in BigDataBench 4.0 in addition to traditional benchmarks like SPECCPU, PARSEC and HPCC in a hierarchical manner and drill down on five levels, using the Top-Down analysis from an architecture perspective.
Towards A Polyglot Framework for Factorized ML
TLDR
Experiments with real datasets show that Trinity is significantly faster than materialized execution (> 8x speedups in some cases), while being largely competitive to a prior single PL-focused Morpheus stack.
Towards A Polyglot Framework for Factorized ML (Information System Architectures)
TLDR
A novel information system architecture is proposed, Trinity, to enable factorized LA logic to be written only once and easily reused across many PLs/LA tools in one go, and to do this in an extensible and efficient manner without costly data copies.
Hybrid Evaluation for Distributed Iterative Matrix Computation
TLDR
This work proposes a hybrid evaluation to efficiently interleave full and incremental evaluation during the iterative process and employs a cost model to compare the overhead costs of two types of evaluations and a selective comparison mechanism to reduce the overhead incurred by comparison itself.
Formal semantics and high performance in declarative machine learning using Datalog
TLDR
It is shown that using aggregates in recursive Datalog programs entails a concise expression of ML applications, while providing a strictly declarative formal semantics, by introducing simple conditions under which the semantics of recursive programs is guaranteed to be equivalent to that of aggregate-stratified ones.
Optimizing end-to-end machine learning pipelines for model training
TLDR
It is concluded that a holistic system design that covers all tiers – programming abstraction, intermediate representation, and execution backend – to overcome the scalability challenges of large-scale data analysis programs is needed.
SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging
TLDR
Experiments with different real-world regression and classification datasets show that effective pruning and efficient sparse linear algebra renders exact enumeration feasible, even for datasets with many features, correlations, and data sizes beyond single node memory.
Cerebro: A Layered Data Platform for Scalable Deep Learning
TLDR
This vision paper presents a vision of a first-of-its-kind data platform for scalable DL, Cerebro, inspired by lessons from the database world, and elevate the DL model selection process with higherlevel APIs already inherent in practice and devise a series of novel multi-query optimization techniques to substantially raise resource efficiency.
...
...

References

SHOWING 1-10 OF 56 REFERENCES
Towards Linear Algebra over Normalized Data
TLDR
A new logical data type is introduced to represent normalized data and a framework of algebraic rewrite rules is devised to convert a large set of linear algebra operations over denormalized data into operations over normalized data to automatically "factorize" several popular ML algorithms.
GenBase: a complex analytics genomics benchmark
TLDR
A new benchmark designed to test database management system (DBMS) performance on a mix of data management tasks and complex analytics (regression, singular value decomposition, etc.) is introduced.
Towards a unified architecture for in-RDBMS analytics
TLDR
This work proposes a unified architecture for in-database analytics that requires changes to only a few dozen lines of code to integrate a new statistical technique, and demonstrates the feasibility of this architecture by integrating several popular analytics techniques into two commercial and one open-source RDBMS.
SystemML: Declarative Machine Learning on Spark
TLDR
This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics.
Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML
TLDR
A systematic approach for combining task and data parallelism for large-scale machine learning on top of MapReduce and a novel cost-based optimization framework for automatically creating optimal parallel execution plans are presented.
The MADlib Analytics Library or MAD Skills, the SQL
TLDR
The MADlib project is introduced, including the background that led to its beginnings, and the motivation for its open-source nature, and an overview of the library's architecture and design patterns is provided, and a description of various statistical methods in that context is provided.
SystemML: Declarative machine learning on MapReduce
TLDR
This paper proposes SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment and describes and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source mapReduce implementation.
Data Management in Machine Learning: Challenges, Techniques, and Systems
TLDR
This tutorial provides a comprehensive review of systems for advanced analytics, integrating ML algorithms and languages with existing data systems such as RDBMSs, and adapting data management-inspired techniques to new systems that target ML workloads.
Starfish: A Self-tuning System for Big Data Analytics
TLDR
Starfish is introduced, a self-tuning system for big data analytics that builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoops.
Distributed Machine Learning-but at what COST ?
TLDR
The results indicate that while being able to robustly scale with increasing data set size, current generation data flow systems are surprisingly inefficient at training machine learning models at need substantial resources to come within reach of the performance of single machine libraries.
...
...