# Numpywren: Serverless Linear Algebra

@article{Shankar2018NumpywrenSL, title={Numpywren: Serverless Linear Algebra}, author={Vaishaal Shankar and Karl Krauth and Qifan Pu and Eric Jonas and Shivaram Venkataraman and Ion Stoica and Benjamin Recht and Jonathan Ragan-Kelley}, journal={ArXiv}, year={2018}, volume={abs/1810.09679} }

Linear algebra operations are widely used in scientific computing and machine learning applications. However, it is challenging for scientists and data analysts to run linear algebra at scales beyond a single machine. Traditional approaches either require access to supercomputing clusters, or impose configuration and cluster management challenges. In this paper we show how the disaggregation of storage and compute resources in so-called "serverless" environments, combined with compute-intensive…

## Figures and Tables from this paper

## 81 Citations

HeAT - a Distributed and GPU-accelerated Tensor Framework for Data Analytics

- Computer ScienceArXiv
- 2020

HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API, is introduced, which achieves speedups of up to two orders of magnitude.

Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

- Computer ScienceProc. ACM Meas. Anal. Comput. Syst.
- 2019

This paper proposes a rateless fountain coding strategy that achieves the best of both worlds -- it is proved that its latency is asymptotically equal to ideal load balancing, and it performs asymPTotically zero redundant computations.

Legate NumPy: accelerated and distributed array computing

- Computer ScienceSC
- 2019

Legate is introduced, a drop-in replacement for NumPy that requires only a single-line code change and can scale up to an arbitrary number of GPU accelerated nodes and achieve speed-ups of up to 10X on 1280 CPUs and 100X on 256 GPUs.

Harnessing the Power of Serverless Runtimes for Large-Scale Optimization

- Computer ScienceArXiv
- 2019

This work builds a master-worker setup using AWS Lambda as the source of workers, implements a parallel optimization algorithm to solve a regularized logistic regression problem, and shows that relative speedups up to 256 workers and efficiencies above 70% up to 64 workers can be expected.

Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

- Computer ScienceAbstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems
- 2020

This work proposes a rateless fountain coding strategy to create linear combinations of the m rows of the matrix and assign these encoded rows to different worker nodes, which achieves optimal latency and performs zero redundant computations asymptotically.

Serverless Elastic Exploration of Unbalanced Algorithms

- Computer Science2020 IEEE 13th International Conference on Cloud Computing (CLOUD)
- 2020

This work shows that with a simple serverless executor pool abstraction one can achieve a better cost-performance trade-off than a Spark cluster of static size and large EC2 VMs, providing the first concrete evidence that highly-parallel, irregular workloads can be efficiently executed using purely stateless functions.

Wukong: a scalable and locality-enhanced framework for serverless parallel computing

- Computer ScienceSoCC
- 2020

This work describes the implementation and deployment of the new serverless parallel framework, called Wukong, on AWS Lambda, and shows that Wukong achieves near-ideal scalability, executes parallel computation jobs up to 68.17X faster, reduces network I/O by multiple orders of magnitude, and achieves 92.96% tenant-side cost savings compared to numpywren.

Exploiting Serverless Runtimes for Large-Scale Optimization

- Computer Science2019 IEEE 12th International Conference on Cloud Computing (CLOUD)
- 2019

This work implements a parallel optimization algorithm for solving a regularized logistic regression problem, and uses AWS Lambda for the compute-intensive work, showing that relative speedups up to 256 workers and efficiencies above 70% up to 64 workers can be expected.

rFaaS: Enabling High Performance Serverless with RDMA and Decentralization

- Computer Science
- 2021

The need for high performance is present in many computing platforms, from batch-managed and scientiﬁc-oriented supercomputers to general-purpose cloud platforms. At the same time, data centers and…

## References

SHOWING 1-10 OF 32 REFERENCES

Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA

- Computer Science2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum
- 2011

It is demonstrated through experimental results on the Cray XT5 Kraken system that the DAG-based approach has the potential to achieve sizable fraction of peak performance which is characteristic of the state-of-the-art distributed numerical software on current and emerging architectures.

MadLINQ: large-scale distributed matrix computation for the cloud

- Computer ScienceEuroSys '12
- 2012

The design and implementation of MadLINQ is described and the system outperforms current state-of-the-art systems by employing two key techniques: exploiting extra parallelism using fine-grained pipelining and efficient on-demand failure recovery using a distributed fault-tolerant execution engine.

Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics

- Computer ScienceNSDI
- 2016

Ernest, a performance prediction framework for large scale analytics, and evaluation on Amazon EC2 using several workloads shows that the prediction error is low while having a training overhead of less than 5% for long-running jobs.

ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance

- Computer ScienceProceedings of the 1996 ACM/IEEE Conference on Supercomputing
- 1996

The content and performance of ScaLAPACK, a collection of mathematical software for linear algebra computations on distributed memory computers, are outlined and alternative approaches to mathematical libraries are suggested, explaining how Sca LAPACK could be integrated into efficient and user-friendly distributed systems.

Communication avoiding and overlapping for numerical linear algebra

- Computer Science2012 International Conference for High Performance Computing, Networking, Storage and Analysis
- 2012

UPC, a partitioned global address space (PGAS) language that provides fast one-sided communication, is employed, and communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system.

Elemental: A New Framework for Distributed Memory Dense Matrix Computations

- Computer ScienceTOMS
- 2013

Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters, and a simple yet effective alternative to the traditional MPI-based approaches.

Minimizing Communication in Numerical Linear Algebra

- Computer ScienceSIAM J. Matrix Anal. Appl.
- 2011

This work generalizes a lower bound on the amount of communication needed to perform dense, n-by-n matrix multiplication using the conventional O(n3) algorithm to a much wider variety of algorithms, including LU factorization, Cholesky factors, LDLT factors, QR factors, the Gram–Schmidt algorithm, and algorithms for eigenvalues and singular values.

Dryad: distributed data-parallel programs from sequential building blocks

- Computer ScienceEuroSys '07
- 2007

The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

SystemML: Declarative Machine Learning on Spark

- Computer ScienceProc. VLDB Endow.
- 2016

This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics.

Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2017

This paper presents a novel parallel execution strategy, CRMM (Concurrent Replication-based Matrix Multiplication), along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms, and proposes a number of novel system-level optimizations.