A New Open Resource Management Architecture in the Sun HPC ClusterToolsTM Environment


This article presents a new architecture for the integration of the Sun HPC ClusterTools™ parallel computing environment with distributed resource management systems such as the Sun™ Grid Engine system. This new architecture achieves a tight integration with multiple distributed resource management systems in a uniform and extensible framework, which means that any of the popular management systems may be used to launch and monitor Sun™ MPI parallel jobs. Unlike previously available loose integrations, tight integrations allow a resource manager (RM) to: s Accurately measure resources used by the parallel processes s Terminate jobs that exceed resource limits s Generate accurate accounting information for multiprocess jobs We have implemented tight integrations with Sun Grid Engine software, PBS from Veridian Systems, and LSF from Platform Computing. We provide examples showing correct resource accounting, ease of use to launch and debug Sun MPI jobs under these systems, and the improvements in behavior that result from the tight integration.

5 Figures and Tables

