A fast GEMM implementation on the cypress GPU

  title={A fast GEMM implementation on the cypress GPU},
  author={Naohito Nakasato},
  journal={SIGMETRICS Performance Evaluation Review},
We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ~ 2 Top/s and ~ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of the theoretical performance of the GPU, respectively. Currently, our SGEMM and DGEMM kernels are fastest with one GPU chip to our knowledge. Furthermore, the… CONTINUE READING
Highly Cited
This paper has 35 citations. REVIEW CITATIONS


Publications citing this paper.
Showing 1-10 of 22 extracted citations

A Trip to Tahiti: Approaching a 5 TFlop SGEMM Using 3 AMD GPUs

2012 Symposium on Application Accelerators in High Performance Computing • 2012
View 6 Excerpts
Highly Influenced

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

2017 46th International Conference on Parallel Processing (ICPP) • 2017
View 4 Excerpts
Highly Influenced

Prediction of SGEMM GPU Kernel Performance Using Supervised and Unsupervised Machine Learning Techniques

2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT) • 2018
View 1 Excerpt

FatMan vs. LittleBoy: Scaling Up Linear Algebraic Operations in Scale-Out Data Platforms

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) • 2016
View 2 Excerpts

Optimizing GPGPU Kernel Summation for Performance and Energy Efficiency

2016 45th International Conference on Parallel Processing Workshops (ICPPW) • 2016
View 1 Excerpt

Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes

IEEE Transactions on Parallel and Distributed Systems • 2015
View 2 Excerpts


Publications referenced by this paper.
Showing 1-5 of 5 references

Benchmarking GPUs to tune dense linear algebra

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis • 2008
View 9 Excerpts
Highly Influenced

The MPACK (MBLAS/MLAPACK); A Multiple Precision Arithmetic Version of BLAS and LAPACK: http://mplapack.sourceforge.net

M. Nakata
View 3 Excerpts
Highly Influenced

Anatomy of high-performance matrix multiplication

ACM Trans. Math. Softw. • 2008
View 3 Excerpts
Highly Influenced

Similar Papers

Loading similar papers…