Learn More
Large-scale GPU clusters are gaining popularity in the scientific computing community. However, their deployment and production use are associated with a number of new challenges. In this paper, we present our efforts to address some of the challenges with building and running GPU clusters in HPC environments. We touch upon such issues as balanced cluster(More)
We present an inexpensive hardware system for monitoring power usage of individual CPU hosts and externally attached GPUs in HPC clusters and the software stack for integrating the power usage data streamed in real-time by the power monitoring hardware with the cluster management software tools. We introduce a measure for quantifying the overall improvement(More)
We present results of porting an important kernel of a production molecular dynamics simulation program, NAMD, to the Cell/B.E. processor. The non-bonded force-field kernel, as implemented in the NAMD SPEC 2006 CPU benchmark, has been implemented. Both single-precision and double-precision floating-point kernel variations are considered, and performance(More)
We report the results of the bottom-up implementation of one MILC lattice quantum chromodynamics (QCD) application on the Cell Broadband Engine™ processor. In our implementation, we preserve MILC’s framework for scaling the application to run on a large number of compute nodes and accelerate computationally intensive kernels on the Cell’s synergistic(More)
We present an implementation of the improved staggered quark action lattice QCD computation designed for execution on a GPU cluster. The parallelization strategy is based on dividing the space-time lattice along the time dimension and distributing the sub-lattices among the GPU cluster nodes. We provide a mixed-precision floating-point GPU implementation of(More)
We present results of the implementation of one MILC lattice QCD application—simulation with dynamical clover fermions using the hybrid-molecular dynamics R algorithm—on the Cell Broadband Engine processor. Fifty-four individual computational kernels responsible for 98.8% of the overall execution time were ported to the Cell’s Synergistic Processing(More)