Learn More
This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA(More)
This paper presents an implementation of the Least Squares Monte Carlo (LSMC) method by Longstaff and Schwartz [1] to price American options on GPU using CUDA. We focused our attention to the calibration phase and performed several experiments to assess the quality of the results. The implementation can price a put option with 200,000 paths and 50 time(More)
In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based(More)
Researchers have recently used the new programmable capabilities of the Graphics Processing Unit (GPU) to increase the performance of scientific codes. We investigate the use of a cluster of GPUs for large-scale CFD problems and show order-of-magnitude increases in performance and performance-to-price ratio. We implement two separate compressible flow(More)
The High Performance Conjugate Gradient (HPCG) benchmark has been recently proposed as a complement to the High Performance Linpack (HPL) benchmark currently used to rank supercomput-ers in the Top500 list. This new benchmark solves a large sparse linear system using a multigrid preconditioned conjugate gradient (PCG) algorithm. The PCG algorithm contains(More)
This paper presents the details of a Synthetic Aperture Radar (SAR) imaging on the smallest CUDA-capable platform available, the Jetson TK1. The results indicate that GPU accelerated embedded platforms have considerable potential for this type of workload and in conjunction with low power consumption, light weight and standard programming tools, could open(More)
This paper presents the details of a CUDA implementation of the PageRank pipeline benchmark [1], a new proposed benchmark aimed to compare and measure the capabilities of big data systems. The reference implementation is only serial at the moment, but our CUDA implementation is parallel. The results indicate that GPU accelerated systems have considerable(More)
  • 1