The OpenCL standard offers a common API for program execution on systems composed of different types of computational devices such as multicore CPUs, GPUs, or other accelerators.
Large-scale GPU clusters are gaining popularity in the scientific computing community. However, their deployment and production use are associated with a number of new challenges. In this paper, we present our efforts to address some of the challenges with building and running GPU clusters in HPC environments. We touch upon such issues as balanced cluster… (More)
We present an inexpensive hardware system for monitoring power usage of individual CPU hosts and externally attached GPUs in HPC clusters and the software stack for integrating the power usage data streamed in real-time by the power monitoring hardware with the cluster management software tools. We introduce a measure for quantifying the overall improvement… (More)
—NVIDIA GPUs are becoming increasingly popular in scientific computation as a way to accelerate the execution of computationally demanding codes. The graphics memory used in GPUs is not protected against soft errors that may be caused by cosmic radiation and thus is a source of concern for the scientific computing community. In this short paper we report on… (More)
We present results of porting an important kernel of a production molecular dynamics simulation program, NAMD, to the Cell/B.E. processor. The non-bonded force-field kernel, as implemented in the NAMD SPEC 2006 CPU benchmark, has been implemented. Both single-precision and double-precision floating-point kernel variations are considered, and performance… (More)
The Cell Broadband Engine is a heterogeneous chip multiprocessor that combines a PowerPC processor core with eight single-instruction multiple-data accelerator cores and delivers high performance on many computationally intensive codes.
We present an implementation of the improved staggered quark action lattice QCD computation designed for execution on a GPU cluster. The parallelization strategy is based on dividing the space-time lattice along the time dimension and distributing the sub-lattices among the GPU cluster nodes. We provide a mixed-precision floating-point GPU implementation of… (More)
We present results of the implementation of one MILC lattice QCD application—simulation with dynamical clover fermions using the hybrid-molecular dynamics R algorithm—on the Cell Broadband Engine processor. Fifty-four individual computational kernels responsible for 98.8% of the overall execution time were ported to the Cell's Synergistic Processing… (More)
—We present a CUDA C implementation of the Conjugate Gradient (CG) and multi-mass CG solver from the MILC quantum chromodynamics package to speedup improved staggered quarks computations on NVIDIA GPUs. The implementation is built on the QUDA package from Boston University.