# The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques

@inproceedings{Haidar2018TheDO, title={The Design of Fast and Energy-Efficient Linear Solvers: On the Potential of Half-Precision Arithmetic and Iterative Refinement Techniques}, author={Azzam Haidar and Ahmad Abdelfattah and Mawussi Zounon and Panruo Wu and Srikara Pranesh and Stanimire Tomov and Jack J. Dongarra}, booktitle={ICCS}, year={2018} }

As parallel computers approach exascale, power efficiency in high-performance computing (HPC) systems is of increasing concern. [... ] Key Method The proposed energy-efficient linear system solvers are based on two main components: (1) iterative refinement techniques, and (2) reduced-precision computing features in modern accelerators and coprocessors. Expand

## 44 Citations

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

- Computer ScienceSC18: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2018

This investigation presents an investigation showing that other high-performance computing (HPC) applications can also harness this power of floating-point arithmetic, and shows how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup.

Numerical algorithms for high-performance computational science

- Computer SciencePhilosophical Transactions of the Royal Society A
- 2020

This article discusses some approaches that can be taken to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers.

Fast Stencil-Code Computation on a Wafer-Scale Processor

- Computer ScienceSC20: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2020

The solution of large, sparse, and often structured systems of linear equations must be solved on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well.

Mixed-Precision Solution of Linear Systems Using Accelerator-Based Computing

- Computer Science
- 2020

It is shown how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability.

White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing

- Computer ScienceArXiv
- 2020

An overview of various technologies related to minimal-and mixed-precision is provided, to outline the future direction of the project, as well as to discuss current challenges together with the project members and guest speakers at the LSPANC 2020 workshop.

Performance impact of precision reduction in sparse linear systems solvers

- Computer SciencePeerJ Comput. Sci.
- 2022

This work evaluates the benefits of using single precision arithmetic in solving a double precision sparse linear system using multiple cores and finds that for the majority of the matrices computing or applying incomplete factorization preconditioners in single precision provides at best modest performance benefits compared with the use of double precision.

Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems

- Computer ScienceProceedings of the Royal Society A
- 2020

It is shown how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability.

Fast linear programming through transprecision computing on small and sparse data

- Computer ScienceProc. ACM Program. Lang.
- 2020

A simplex solver targeted at compilers that reduces memory traffic, exploits wide vectors, uses low-precision arithmetic units effectively, and exploits the wide SIMD instructions of modern microarchitectures effectively is designed.

Modeling the effect of application-specific program transformations on energy and performance improvements of parallel ODE solvers

- Computer ScienceJ. Comput. Sci.
- 2021

SIMULATING LOW PRECISION FLOATING-POINT ARITHMETIC Higham, Nicholas J. and Pranesh, Srikara 2019

- Computer Science
- 2019

This work provides a MATLAB function chop that can be used to efficiently simulate fp16, bfloat16, and other low precision arithmetics, with or without the representation of subnormal numbers and with the options of round to nearest, directed rounding, stochastic rounding, and random bit flips in the significand.

## References

SHOWING 1-10 OF 21 REFERENCES

Investigating half precision arithmetic to accelerate dense linear system solvers

- Computer ScienceScalA@SC
- 2017

This work shows for a first time how the use of FP16 arithmetic can significantly accelerate, as well as make more energy efficient, FP32 or FP64-precision Ax = b solvers.

Dense linear algebra solvers for multicore with GPU accelerators

- Computer Science2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
- 2010

This work describes how to code/develop solvers to effectively use the high computing power available in these new and emerging hybrid architectures of dense linear algebra (DLA) for multicore with GPU accelerators, and develops newly developed DLA solvers.

PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications

- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2010

A framework to isolate the power consumption of devices including disks, memory, NICs, and processors in a high-performance cluster and correlate these measurements to application functions is extended and conclusively how intelligent DVFS scheduling can enhance system energy efficiency while maintaining performance is revealed.

Adagio: making DVS practical for complex HPC applications

- Computer ScienceICS '09
- 2009

Adagio is presented, a novel runtime system that makes DVS practical for complex, real-world scientific applications by incurring only negligible delay while achieving significant energy savings.

Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing

- Computer Science2010 International Conference on Field-Programmable Technology
- 2010

Evaluating the High-Productivity Reconfigurable Computer (HPRC) approach to FPGA programming, where a commodity CPU instruction set architecture is augmented with instructions which execute on a specialised FPGa co-processor, shows that high-productivity reconfigurable computing systems outperform GPUs in applications with poor locality characteristics and low memory bandwidth requirements.

Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi

- Computer Science2017 IEEE High Performance Extreme Computing Conference (HPEC)
- 2017

A detailed study and investigation toward controlling power usage and exploring how different power caps affect the performance of numerical algorithms with different computational intensities, and determine the impact and correlation with performance of scientific applications is presented.

Towards dense linear algebra for hybrid GPU accelerated manycore systems

- Computer ScienceParallel Comput.
- 2010

MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing

- Computer Science2015 IEEE High Performance Extreme Computing Conference (HPEC)
- 2015

The design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra, using the LU, QR, and Cholesky factorizations, and developed performance optimizations for both small and large problems.

Understanding the future of energy-performance trade-off via DVFS in HPC environments

- Computer ScienceJ. Parallel Distributed Comput.
- 2012

Power Aware Computing on GPUs

- Computer Science2012 Symposium on Application Accelerators in High Performance Computing
- 2012

The Activity-based Model for GPUs (AMG) is introduced, from which activity factors and power for micro architectures on GPUs that will help in analyzing power tradeoffs of one component versus another using micro benchmarks are identified.