Supada Laosooksathit

  • Citations Per Year
Learn More
Due to the fact that the reliability and availability of a large scaled system inverse to the number of computing elements, fault tolerance has become a major concern in high performance computing (HPC) including a very large system with GPGPU. In this paper, we propose a checkpoint/restart mechanism model which employs two-phase protocol and a latency(More)
Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to(More)
  • 1