• Corpus ID: 18111389

General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform

@inproceedings{Christen2007GeneralPurposeSM,
  title={General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform},
  author={Matthias Christen and Olaf Schenk and Helmar Burkhart},
  year={2007}
}
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floa tingpoint co-processors to accelerate two fundamental computa tional scientific kernels on the GPU: sparse direct factorization a nd nonlinear interior-point optimization. Since a full re-imple mentation of these complex kernels is typically not feasible, we ident ify e.g. the matrix-matrix multiplication as a first natural entry-p oint for a minimally invasive integration of GPUs. We… 
Singular value decomposition on GPU using CUDA
TLDR
This paper presents the implementation of singular value decomposition (SVD) of a dense matrix on GPU using the CUDA programming model and shows a speedup of upto 60 over the MATLAB implementation and upto 8 over the Intel MKL implementation on a Intel Dual Core 2.66GHz PC for large matrices.
Multifrontal Factorization of Sparse SPD Matrices on GPUs
TLDR
This paper presents an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU, and proposes a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization.
Effective Sparse Matrix Representation for the GPU Architectures
TLDR
A new format for sparse matrix representation with respect to graphics processor architecture is given that can give 2x to 5x performance improvement compared to CSR (compressed row format) and 3x to 10 x improvement against CSR vector format for the class of application that fit for the proposed new format.
Sparse LU factorization for parallel circuit simulation on GPU
TLDR
A GPU-based sparse LU solver for circuit simulation is proposed, which optimize the work partitioning, the number of active thread groups, and the memory access pattern, based on GPU architecture, and analyzes the scalability of parallel sparse LU factorization.
GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling
TLDR
This paper develops a hybrid parallel LU factorization approach combining task-level and data-level parallelism on GPUs and investigates bottlenecks of the proposed approach by a parametric performance model.
Automatic transformation and optimization of applications on gpus and gpu clusters
TLDR
An auto-tuning framework which selects algorithms and parameters according to some cost model and thresholds extracted from simple micro-benchmarks is developed, and a loop transformation system in the environment of multi-level memory hierarchy is developed.
Model-driven autotuning of sparse matrix-vector multiply on GPUs
TLDR
A performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU) and shows that the model can identify the implementations that achieve within 15% of those found through exhaustive search.
Generating Approximate Inverse Preconditioners for Sparse Matrices Using CUDA and GPGPU
TLDR
The techniques associated with applying GP-GPU and CUDA to the generation of right-looking approximate inverse preconditioners (AINV) and precONDitioned GMRES based on them are discussed and can be applied to other Krylov solvers and preconditionsers.
Nonzero pattern analysis and memory access optimization in GPU-based sparse LU factorization for circuit simulation
TLDR
This work investigates the nonzero patterns and memory access patterns in sparse LU factorization, and explores the common features to give guidelines on the improvements of the GPU solvers.
Accelerating frequency-domain diffuse optical tomographic image reconstruction using graphics processing units.
TLDR
These studies indicate that single precision computations are sufficient for diffuse optical tomographic image reconstruction, compared to traditional CPUs in case of three-dimensional reconstruction, making the GPUs more attractive in the clinical settings.
...
1
2
3
...

References

SHOWING 1-10 OF 31 REFERENCES
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
TLDR
A novel algorithm to solve dense linear systems using graphics processors (GPUs) by reducing matrix decomposition and row operations to a series of rasterization problems on the GPU and demonstrating that the commodity GPU is a useful co-processor for many scientific applications.
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
TLDR
This work examines sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs, and presents several optimization strategies especially effective for the multicore environment.
Linear algebra operators for GPU implementation of numerical algorithms
TLDR
This work proposes a stream model for arithmetic operations on vectors and matrices that exploits the intrinsic parallelism and efficient communication on modern GPUs and introduces a framework for the implementation of linear algebra operators on programmable graphics processors (GPUs), thus providing the building blocks for the design of more complex numerical algorithms.
Linear algebra operators for GPU implementation of numerical algorithms
TLDR
This work proposes a stream model for arithmetic operations on vectors and matrices that exploits the intrinsic parallelism and efficient communication on modern GPUs and introduces a framework for the implementation of linear algebra operators on programmable graphics processors (GPUs), thus providing the building blocks for the design of more complex numerical algorithms.
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
TLDR
This work implemented two basic, broadly useful, computational kernels: a sparse matrix conjugate gradient solver and a regular-grid multigrid solver for high-intensity numerical simulation of geometric flow and fluid simulation on the GPU.
Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations
TLDR
A parallel factorization code is described which utilizes the supernodal structure of the matrix to substantially reduce the number of memory references, resulting in greatly increased factorization performance.
Block sparse Cholesky algorithms on advanced uniprocessor computers
TLDR
Two sparse Cholesky factorization algorithms are examined in a systematic and consistent fashion, both to illustrate the strengths of the blocking techniques in general and to obtain a fair evaluation of the two approaches.
A Survey of General-Purpose Computation on Graphics Hardware
TLDR
The techniques used in mapping general-purpose computation to graphics hardware will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques.
Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations
TLDR
This survey paper compares native double precision solvers with emulated- and mixed-precision solvers of linear systems of equations as they typically arise in finite element discretisations and concludes that the mixed precision approach works very well with the parallel co-processors gaining speedup factors and area savings, while maintaining the same accuracy as a reference solver executing everything in double precision.
Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO
TLDR
Experiments demonstrate that a wide set of unsymmetric linear systems can be solved and high performance is consistently achieved for large sparse unsympetric matrices from real world applications.
...
1
2
3
4
...