Avoiding communication in sparse matrix computations

@article{Demmel2008AvoidingCI,
  title={Avoiding communication in sparse matrix computations},
  author={James Demmel and Mark Hoemmen and Marghoob Mohiyuddin and Katherine A. Yelick},
  journal={2008 IEEE International Symposium on Parallel and Distributed Processing},
  year={2008},
  pages={1-12}
}
  • J. Demmel, M. Hoemmen, K. Yelick
  • Published 14 April 2008
  • Computer Science
  • 2008 IEEE International Symposium on Parallel and Distributed Processing
The performance of sparse iterative solvers is typically limited by sparse matrix-vector multiplication, which is itself limited by memory system and network performance. As the gap between computation and communication speed continues to widen, these traditional sparse methods will suffer. In this paper we focus on an alternative building block for sparse iterative solvers, the "matrix powers kernel" [x, Ax, A2x, ..., Akx], and show that by organizing computations around this kernel, we can… 

Figures and Tables from this paper

Minimizing communication in sparse matrix solvers
TLDR
This work reorganizes the sparse-matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, resulting in a new variant of GMRES that gets speedups of up to 4.3x over standard GMRES, without sacrificing convergence rate or numerical stability.
Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication
TLDR
This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparse-dense matrix-matrix multiplication and presents new communication-avoiding algorithms based on a 1D decomposition, called 1.5D.
Communication-optimal iterative methods
TLDR
This work reorganizes the sparse matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, resulting in a growth in both single-node and network communication.
Communication lower bounds and optimal algorithms for numerical linear algebra*†
TLDR
This paper describes lower bounds on communication in linear algebra, and presents lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices.
On parallelizing matrix multiplication by the column-row method
TLDR
This work presents a surprisingly simple method for "consistent" parallel processing of sparse outer products (column-row vector products) over several processors, in a communication-avoiding setting where each processor has a copy of the input.
Enlarged Krylov Subspace Methods and Preconditioners for Avoiding Communication
TLDR
This thesis presents a communication avoiding ILU0 preconditioner for solving large systems of linear equations by using iterative Krylov subspace methods, and introduces a new approach for reducing communication in the KrylovSubspace methods.
Chapter 9 Communication Avoiding ( CA ) and Other Innovative Algorithms
TLDR
The lower bound technique is extended to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if the authors can do better.
A Communication-Avoiding, Hybrid-Parallel, Rank-Revealing Orthogonalization Method
  • M. Hoemmen
  • Computer Science
    2011 IEEE International Parallel & Distributed Processing Symposium
  • 2011
TLDR
This Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization.
Exploiting dense substructures for fast sparse matrix vector multiplication
TLDR
An effective density measure is proposed that could be used for method selection, thus adding to the variety of options for an auto-tuned optimized SMV kernel that can exploit sparse matrix properties and hardware attributes for high performance.
...
...

References

SHOWING 1-10 OF 33 REFERENCES
Avoiding Communication in Computing Krylov Subspaces
TLDR
This paper presents just part of this new formulation of Krylov Subspace Methods, namely computing a basis of the Krylov subspace spanned by [x,Ax, Axel, ...Ax], and shows how to avoid latency and bandwidth for this kernel as well.
Automatic performance tuning of sparse matrix kernels
TLDR
An automated system to generate highly efficient, platform-adapted implementations of sparse matrix kernels, and extends SPARSITY to support tuning for a variety of common non-zero patterns arising in practice, and for additional kernels like sparse triangular solve (SpTS) and computation of ATA·x and A ρ·x.
A PARALLEL VARIANT OF GMRES(m)
TLDR
It is shown, on real world problems, that the modGMRES(m) method can yield a considerable gain in time per iteration, especially on large processor grids where the time spent in communication in the MGS process may be significant.
Rescheduling for Locality in Sparse Matrix Computations
TLDR
An algorithm to tile at runtime called serial sparse tiling, a runtime tiled version of sparse Gauss-Seidel, which is tested on 4 different architectures where it exhibits speedups of up to 2.7.
Implicit and explicit optimizations for stencil computations
TLDR
Several optimizations on both the conventional cache-based memory systems of the Itanium 2, Opteron, and Power5, as well as the heterogeneous multicore design of the Cell processor are examined, including both an implicit cache oblivious approach and a cache-aware algorithm blocked to match the cache structure.
Practical Use of Polynomial Preconditionings for the Conjugate Gradient Method
TLDR
A version of the conjugate gradient algorithm is formulated that is more suitable for parallel architectures and the advantages of polynomial preconditioning in the context of these architectures are discussed.
Cache Optimization for Structured and Unstructured Grid Multigrid
TLDR
In this paper, suitable blocking strategies for both structured and unstructured grids will be introduced to improve the cache usage without changing the underlying algorithm.
Program partitioning and synchronization on multiprocessor systems
TLDR
A minimum distance method which partitions a recurrence loop into independent execution sets and uses the minimum dependence distance of each dimension of all dependence cycles to divide the index set of the loop intoindependent partitions is introduced.
OSKI: A Library of Automatically Tuned Sparse Matrix Kernels
TLDR
An overview of OSKI is provided, which is based on research on automatically tuned sparse kernels for modern cache-based superscalar machines, and the primary aim of this interface is to hide the complex decision-making process needed to tune the performance of a kernel implementation for a particular user's sparse matrix and machine.
Using time skewing to eliminate idle time due to memory bandwidth and network limitations
  • D. Wonnacott
  • Computer Science
    Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000
  • 2000
TLDR
A generalization of time skewing for multiprocessor architectures is given, and techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.
...
...