The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques

@inproceedings{Haidar2018TheDO,
  title={The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques},
  author={Azzam Haidar and Ahmad Abdelfattah and Mawussi Zounon and Panruo Wu and Srikara Pranesh and Stanimire Tomov and Jack J. Dongarra},
  booktitle={ICCS},
  year={2018}
}
As parallel computers approach exascale, power efficiency in high-performance computing (HPC) systems is of increasing concern. [] Key Method The proposed energy-efficient linear system solvers are based on two main components: (1) iterative refinement techniques, and (2) reduced-precision computing features in modern accelerators and coprocessors.
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
TLDR
This investigation presents an investigation showing that other high-performance computing (HPC) applications can also harness this power of floating-point arithmetic, and shows how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup.
Numerical algorithms for high-performance computational science
TLDR
This article discusses some approaches that can be taken to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers.
Fast Stencil-Code Computation on a Wafer-Scale Processor
TLDR
The solution of large, sparse, and often structured systems of linear equations must be solved on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well.
Mixed-Precision Solution of Linear Systems Using Accelerator-Based Computing
TLDR
It is shown how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability.
White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing
TLDR
An overview of various technologies related to minimal-and mixed-precision is provided, to outline the future direction of the project, as well as to discuss current challenges together with the project members and guest speakers at the LSPANC 2020 workshop.
Performance impact of precision reduction in sparse linear systems solvers
TLDR
This work evaluates the benefits of using single precision arithmetic in solving a double precision sparse linear system using multiple cores and finds that for the majority of the matrices computing or applying incomplete factorization preconditioners in single precision provides at best modest performance benefits compared with the use of double precision.
Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems
TLDR
It is shown how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability.
Fast linear programming through transprecision computing on small and sparse data
TLDR
A simplex solver targeted at compilers that reduces memory traffic, exploits wide vectors, uses low-precision arithmetic units effectively, and exploits the wide SIMD instructions of modern microarchitectures effectively is designed.
SIMULATING LOW PRECISION FLOATING-POINT ARITHMETIC Higham, Nicholas J. and Pranesh, Srikara 2019
TLDR
This work provides a MATLAB function chop that can be used to efficiently simulate fp16, bfloat16, and other low precision arithmetics, with or without the representation of subnormal numbers and with the options of round to nearest, directed rounding, stochastic rounding, and random bit flips in the significand.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 21 REFERENCES
Investigating half precision arithmetic to accelerate dense linear system solvers
TLDR
This work shows for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers.
Dense linear algebra solvers for multicore with GPU accelerators
TLDR
This work describes how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures of dense linear algebra (DLA) for multicore with GPU accelerators, and develops newly developed DLA solvers.
PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications
TLDR
A framework to isolate the power consumption of devices including disks, memory, NICs, and processors in a high-performance cluster and correlate these measurements to application functions is extended and conclusively how intelligent DVFS scheduling can enhance system energy efficiency while maintaining performance is revealed.
Adagio: making DVS practical for complex HPC applications
TLDR
Adagio is presented, a novel runtime system that makes DVS practical for complex, real-world scientific applications by incurring only negligible delay while achieving significant energy savings.
Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing
TLDR
Evaluating the High-Productivity Reconfigurable Computer (HPRC) approach to FPGA programming, where a commodity CPU instruction set architecture is augmented with instructions which execute on a specialised FPGa co-processor, shows that high-productivity reconfigurable computing systems outperform GPUs in applications with poor locality characteristics and low memory bandwidth requirements.
Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi
TLDR
A detailed study and investigation toward controlling power usage and exploring how different power caps affect the performance of numerical algorithms with different computational intensities, and determine the impact and correlation with performance of scientific applications is presented.
Towards dense linear algebra for hybrid GPU accelerated manycore systems
MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing
TLDR
The design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra, using the LU, QR, and Cholesky factorizations, and developed performance optimizations for both small and large problems.
Understanding the future of energy-performance trade-off via DVFS in HPC environments
Power Aware Computing on GPUs
TLDR
The Activity-based Model for GPUs (AMG) is introduced, from which activity factors and power for micro architectures on GPUs that will help in analyzing power tradeoffs of one component versus another using micro benchmarks are identified.
...
1
2
3
...