• Corpus ID: 53038883

Numpywren: Serverless Linear Algebra

@article{Shankar2018NumpywrenSL,
  title={Numpywren: Serverless Linear Algebra},
  author={Vaishaal Shankar and Karl Krauth and Qifan Pu and Eric Jonas and Shivaram Venkataraman and Ion Stoica and Benjamin Recht and Jonathan Ragan-Kelley},
  journal={ArXiv},
  year={2018},
  volume={abs/1810.09679}
}
Linear algebra operations are widely used in scientific computing and machine learning applications. However, it is challenging for scientists and data analysts to run linear algebra at scales beyond a single machine. Traditional approaches either require access to supercomputing clusters, or impose configuration and cluster management challenges. In this paper we show how the disaggregation of storage and compute resources in so-called "serverless" environments, combined with compute-intensive… 
HeAT - a Distributed and GPU-accelerated Tensor Framework for Data Analytics
TLDR
HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API, is introduced, which achieves speedups of up to two orders of magnitude.
Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication
TLDR
This paper proposes a rateless fountain coding strategy that achieves the best of both worlds -- it is proved that its latency is asymptotically equal to ideal load balancing, and it performs asymPTotically zero redundant computations.
Legate NumPy: accelerated and distributed array computing
TLDR
Legate is introduced, a drop-in replacement for NumPy that requires only a single-line code change and can scale up to an arbitrary number of GPU accelerated nodes and achieve speed-ups of up to 10X on 1280 CPUs and 100X on 256 GPUs.
Harnessing the Power of Serverless Runtimes for Large-Scale Optimization
TLDR
This work builds a master-worker setup using AWS Lambda as the source of workers, implements a parallel optimization algorithm to solve a regularized logistic regression problem, and shows that relative speedups up to 256 workers and efficiencies above 70% up to 64 workers can be expected.
Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication
TLDR
This work proposes a rateless fountain coding strategy to create linear combinations of the m rows of the matrix and assign these encoded rows to different worker nodes, which achieves optimal latency and performs zero redundant computations asymptotically.
Serverless Elastic Exploration of Unbalanced Algorithms
TLDR
This work shows that with a simple serverless executor pool abstraction one can achieve a better cost-performance trade-off than a Spark cluster of static size and large EC2 VMs, providing the first concrete evidence that highly-parallel, irregular workloads can be efficiently executed using purely stateless functions.
Wukong: a scalable and locality-enhanced framework for serverless parallel computing
TLDR
This work describes the implementation and deployment of the new serverless parallel framework, called Wukong, on AWS Lambda, and shows that Wukong achieves near-ideal scalability, executes parallel computation jobs up to 68.17X faster, reduces network I/O by multiple orders of magnitude, and achieves 92.96% tenant-side cost savings compared to numpywren.
Benchmarking Parallelism in FaaS Platforms
Exploiting Serverless Runtimes for Large-Scale Optimization
TLDR
This work implements a parallel optimization algorithm for solving a regularized logistic regression problem, and uses AWS Lambda for the compute-intensive work, showing that relative speedups up to 256 workers and efficiencies above 70% up to 64 workers can be expected.
rFaaS: Enabling High Performance Serverless with RDMA and Decentralization
The need for high performance is present in many computing platforms, from batch-managed and scientific-oriented supercomputers to general-purpose cloud platforms. At the same time, data centers and
...
...

References

SHOWING 1-10 OF 32 REFERENCES
Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
TLDR
It is demonstrated through experimental results on the Cray XT5 Kraken system that the DAG-based approach has the potential to achieve sizable fraction of peak performance which is characteristic of the state-of-the-art distributed numerical software on current and emerging architectures.
MadLINQ: large-scale distributed matrix computation for the cloud
TLDR
The design and implementation of MadLINQ is described and the system outperforms current state-of-the-art systems by employing two key techniques: exploiting extra parallelism using fine-grained pipelining and efficient on-demand failure recovery using a distributed fault-tolerant execution engine.
Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics
TLDR
Ernest, a performance prediction framework for large scale analytics, and evaluation on Amazon EC2 using several workloads shows that the prediction error is low while having a training overhead of less than 5% for long-running jobs.
ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance
TLDR
The content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers, are outlined and alternative approaches to mathematical libraries are suggested, explaining how Sca LAPACK could be integrated into efficient and user-friendly distributed systems.
Communication avoiding and overlapping for numerical linear algebra
TLDR
UPC, a partitioned global address space (PGAS) language that provides fast one-sided communication, is employed, and communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system.
Elemental: A New Framework for Distributed Memory Dense Matrix Computations
TLDR
Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters, and a simple yet effective alternative to the traditional MPI-based approaches.
Minimizing Communication in Numerical Linear Algebra
TLDR
This work generalizes a lower bound on the amount of communication needed to perform dense, n-by-n matrix multiplication using the conventional O(n3) algorithm to a much wider variety of algorithms, including LU factorization, Cholesky factors, LDLT factors, QR factors, the Gram–Schmidt algorithm, and algorithms for eigenvalues and singular values.
Dryad: distributed data-parallel programs from sequential building blocks
TLDR
The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
SystemML: Declarative Machine Learning on Spark
TLDR
This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics.
Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms
TLDR
This paper presents a novel parallel execution strategy, CRMM (Concurrent Replication-based Matrix Multiplication), along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms, and proposes a number of novel system-level optimizations.
...
...