# Improving Performance of Hypermatrix Cholesky Factorization

@inproceedings{Herrero2003ImprovingPO, title={Improving Performance of Hypermatrix Cholesky Factorization}, author={Jos{\'e} R. Herrero and Juan J. Navarro}, booktitle={Euro-Par}, year={2003} }

This paper shows how a sparse hypermatrix Cholesky factorization can be improved. This is accomplished by means of efficient codes which operate on very small dense matrices. Different matrix sizes or target platforms may require different codes to obtain good performance. We write a set of codes for each matrix operation using different loop orders and unroll factors. Then, for each matrix size, we automatically compile each code fixing matrix leading dimensions and loop sizes, run the…

## 19 Citations

Analysis of a sparse hypermatrix Cholesky with fixed-sized blocking

- Computer ScienceApplicable Algebra in Engineering, Communication and Computing
- 2007

This work presents the way in which an implementation of a sparse Cholesky factorization based on a hypermatrix data structure is constructed and compares its performance with several other codes and analyzes the results.

Optimization of a Statically Partitioned Hypermatrix Sparse Cholesky Factorization

- Computer SciencePARA
- 2004

This paper presents an improvement to the sequential in-core implementation of a sparse Cholesky factorization based on a hypermatrix storage structure, compares its performance with several codes and analyzes the results.

Reducing Overhead in Sparse Hypermatrix Cholesky Factorization

- Computer Science
- 2005

This paper presents several techniques for reducing the operations on zeros in a sparse hypermatrix Cholesky factorization, including associating a bit to each column within a data submatrix which stores non-zeros (dense window).

Intra-Block Amalgamation in Sparse Hypermatrix Cholesky Factorization

- Computer Science
- 2005

An improvement to the sequential in-core implementation of a sparse Cholesky factorization based on a hypermatrix storage structure is presented, which allows the inclusion of additional zeros in data submatrices to create larger blocks and uses more efficient routines for matrix multiplication.

Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology

- Computer ScienceSoftware Automatic Tuning, From Concepts to State-of-the-Art Results
- 2010

A compiler optimization approach that combines novel autotuning compiler technology with specialization for expected data set sizes of key computations, focused on matrix multiplication of small matrices is presented.

A Study on Load Imbalance in Parallel Hypermatrix Multiplication Using OpenMP

- Computer SciencePPAM
- 2005

This work used OpenMP for the parallelization of a matrix multiplication code based on the hypermatrix data structure and experimented with several features available with OpenMP in the Intel Fortran Compiler: scheduling algorithms, chunk sizes and nested parallelism.

Improving High-Performance Sparse Libraries Using Compiler-Assisted Specialization: A PETSc Case Study

- Computer Science2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
- 2012

This work studies the performance gains achieved by specializing the single processor sparse linear algebra functions in PETSc (Portable, Extensible Toolkit for Scientific Computation) in the context of three scalable scientific applications on the Hopper Cray XE6 Supercomputer at NERSC.

Compiler-Optimized Kernels: An Efficient Alternative to Hand-Coded Inner Kernels

- Computer ScienceICCSA
- 2006

This paper presents an alternative way to produce efficient matrix multiplication kernels based on a set of simple codes which can be parameterized at compilation time, able to produce high performance sparse and dense linear algebra codes on a variety of platforms.

Hypermatrix oriented supernode amalgamation

- Computer ScienceThe Journal of Supercomputing
- 2008

A supernode amalgamation algorithm which takes into account the characteristics of a hypermatrix data structure is introduced and the resulting frontal tree is then used to create a variable-sized partitioning of thehypermatrix.

Improving high-performance sparse libraries using compiler assisted specialization: A PETSc (portable, extensible toolkit for scientific computation) case study

- Computer Science
- 2012

This work study the effects of the execution enviroment on sparse computations and design optimization strategies based on these effects that include novel techniques that augment well-known source-to-source transformations to significantly improve the quality of the instructions generated by the back end compiler.

## References

SHOWING 1-10 OF 27 REFERENCES

Block sparse Cholesky algorithms on advanced uniprocessor computers

- Computer Science
- 1991

Two sparse Cholesky factorization algorithms are examined in a systematic and consistent fashion, both to illustrate the strengths of the blocking techniques in general and to obtain a fair evaluation of the two approaches.

An efficient block-oriented approach to parallel sparse Cholesky factorization

- Computer ScienceSupercomputing '93. Proceedings
- 1993

The authors propose and evaluate an approach that is simple to implement, provides slightly higher performance than column (and panel) methods on small parallel machines, and has the potential to provide much higher performance on large parallel machines.

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

- Computer ScienceICS '97
- 1997

A BLAS GEMM compatible multi-level cache-blocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/8Oi, SGI Power Challenge RBk, and SGI Octane RlOk, and over 80% ofpeak on the SGI Indigo R4k.

The influence of relaxed supernode partitions on the multifrontal method

- Computer ScienceTOMS
- 1989

An algorithm for partitioning the nodes of a graph into supernodes is presented, which improves the performance of the multifrontal method for the factorization of large, sparse matrices on vector computers, and factorizes the extremely sparse electric power matrices faster than the general sparse algorithm.

Hypermatrix solution of large sets of symmetric positive-definite linear equations

- Computer Science
- 1972

BlockSolve95 users manual: Scalable library software for the parallel solution of sparse linear systems

- Computer Science
- 1995

This report gives detailed instructions on the use of BlockSolve95 and descriptions of a number of program examples that can be used as templates for application programs.

Data prefetching and multilevel blocking for linear algebra operations

- Computer ScienceICS '96
- 1996

This paper analyzes the behavior of matrix multiplication algorithms for large matrices on a superscalar and superpipelined processor with a multilevel memory hierarchy when these techniques are applied together, and compares two different approaches to data prefetching, binding versus non-binding, and finds the latter remarkably more effective than the former due mainly to its flexibility.

Sparse Matrix Structure for Dynamic Parallelisation Efficiency

- Computer ScienceEuro-Par
- 2000

A new approach for blocking that saves storage and decreases the computation critical path is presented and a data distribution step is proposed that drives the dynamic scheduler decisions such that an efficient parallelisation can be achieved even on slow multiprocessor networks.

Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation

- Computer ScienceProceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622)
- 2000

This paper addresses the problem of how to select tile sizes and unroll factors simultaneously and compares the levels of optimization obtained by iterative compilation to several well-known static techniques and shows that they outperform each of them on a range of benchmarks across a variety of architectures.

Automatically Tuned Linear Algebra Software

- Computer ScienceProceedings of the IEEE/ACM SC98 Conference
- 1998

An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS).