Improving Performance of Hypermatrix Cholesky Factorization

@inproceedings{Herrero2003ImprovingPO,
  title={Improving Performance of Hypermatrix Cholesky Factorization},
  author={Jos{\'e} R. Herrero and Juan J. Navarro},
  booktitle={Euro-Par},
  year={2003}
}
This paper shows how a sparse hypermatrix Cholesky factorization can be improved. This is accomplished by means of efficient codes which operate on very small dense matrices. Different matrix sizes or target platforms may require different codes to obtain good performance. We write a set of codes for each matrix operation using different loop orders and unroll factors. Then, for each matrix size, we automatically compile each code fixing matrix leading dimensions and loop sizes, run the… 
Analysis of a sparse hypermatrix Cholesky with fixed-sized blocking
  • J. HerreroJ. Navarro
  • Computer Science
    Applicable Algebra in Engineering, Communication and Computing
  • 2007
TLDR
This work presents the way in which an implementation of a sparse Cholesky factorization based on a hypermatrix data structure is constructed and compares its performance with several other codes and analyzes the results.
Optimization of a Statically Partitioned Hypermatrix Sparse Cholesky Factorization
TLDR
This paper presents an improvement to the sequential in-core implementation of a sparse Cholesky factorization based on a hypermatrix storage structure, compares its performance with several codes and analyzes the results.
Reducing Overhead in Sparse Hypermatrix Cholesky Factorization
TLDR
This paper presents several techniques for reducing the operations on zeros in a sparse hypermatrix Cholesky factorization, including associating a bit to each column within a data submatrix which stores non-zeros (dense window).
Intra-Block Amalgamation in Sparse Hypermatrix Cholesky Factorization
TLDR
An improvement to the sequential in-core implementation of a sparse Cholesky factorization based on a hypermatrix storage structure is presented, which allows the inclusion of additional zeros in data submatrices to create larger blocks and uses more efficient routines for matrix multiplication.
Autotuning and Specialization: Speeding up Matrix Multiply for Small Matrices with Compiler Technology
TLDR
A compiler optimization approach that combines novel autotuning compiler technology with specialization for expected data set sizes of key computations, focused on matrix multiplication of small matrices is presented.
A Study on Load Imbalance in Parallel Hypermatrix Multiplication Using OpenMP
TLDR
This work used OpenMP for the parallelization of a matrix multiplication code based on the hypermatrix data structure and experimented with several features available with OpenMP in the Intel Fortran Compiler: scheduling algorithms, chunk sizes and nested parallelism.
Improving High-Performance Sparse Libraries Using Compiler-Assisted Specialization: A PETSc Case Study
TLDR
This work studies the performance gains achieved by specializing the single processor sparse linear algebra functions in PETSc (Portable, Extensible Toolkit for Scientific Computation) in the context of three scalable scientific applications on the Hopper Cray XE6 Supercomputer at NERSC.
Compiler-Optimized Kernels: An Efficient Alternative to Hand-Coded Inner Kernels
TLDR
This paper presents an alternative way to produce efficient matrix multiplication kernels based on a set of simple codes which can be parameterized at compilation time, able to produce high performance sparse and dense linear algebra codes on a variety of platforms.
Hypermatrix oriented supernode amalgamation
TLDR
A supernode amalgamation algorithm which takes into account the characteristics of a hypermatrix data structure is introduced and the resulting frontal tree is then used to create a variable-sized partitioning of thehypermatrix.
Improving high-performance sparse libraries using compiler assisted specialization: A PETSc (portable, extensible toolkit for scientific computation) case study
TLDR
This work study the effects of the execution enviroment on sparse computations and design optimization strategies based on these effects that include novel techniques that augment well-known source-to-source transformations to significantly improve the quality of the instructions generated by the back end compiler.
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Block sparse Cholesky algorithms on advanced uniprocessor computers
TLDR
Two sparse Cholesky factorization algorithms are examined in a systematic and consistent fashion, both to illustrate the strengths of the blocking techniques in general and to obtain a fair evaluation of the two approaches.
An efficient block-oriented approach to parallel sparse Cholesky factorization
TLDR
The authors propose and evaluate an approach that is simple to implement, provides slightly higher performance than column (and panel) methods on small parallel machines, and has the potential to provide much higher performance on large parallel machines.
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
TLDR
A BLAS GEMM compatible multi-level cache-blocked matrix multiply generator which produces code that achieves around 90% of peak on the Sparcstation-20/61, IBM RS/6000-590, HP 712/8Oi, SGI Power Challenge RBk, and SGI Octane RlOk, and over 80% ofpeak on the SGI Indigo R4k.
The influence of relaxed supernode partitions on the multifrontal method
TLDR
An algorithm for partitioning the nodes of a graph into supernodes is presented, which improves the performance of the multifrontal method for the factorization of large, sparse matrices on vector computers, and factorizes the extremely sparse electric power matrices faster than the general sparse algorithm.
BlockSolve95 users manual: Scalable library software for the parallel solution of sparse linear systems
TLDR
This report gives detailed instructions on the use of BlockSolve95 and descriptions of a number of program examples that can be used as templates for application programs.
Data prefetching and multilevel blocking for linear algebra operations
TLDR
This paper analyzes the behavior of matrix multiplication algorithms for large matrices on a superscalar and superpipelined processor with a multilevel memory hierarchy when these techniques are applied together, and compares two different approaches to data prefetching, binding versus non-binding, and finds the latter remarkably more effective than the former due mainly to its flexibility.
Sparse Matrix Structure for Dynamic Parallelisation Efficiency
TLDR
A new approach for blocking that saves storage and decreases the computation critical path is presented and a data distribution step is proposed that drives the dynamic scheduler decisions such that an efficient parallelisation can be achieved even on slow multiprocessor networks.
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
TLDR
This paper addresses the problem of how to select tile sizes and unroll factors simultaneously and compares the levels of optimization obtained by iterative compilation to several well-known static techniques and shows that they outperform each of them on a range of benchmarks across a variety of architectures.
Automatically Tuned Linear Algebra Software
TLDR
An approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units using the widely used linear algebra kernels called the Basic Linear Algebra Subroutines (BLAS).
...
...