Learn More
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often(More)
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and(More)
The virtualization gives the power of partitioning the physical host in to multiple virtual machines. We can manage the number of active host and their power consumption by migrating the virtual machines according to their resource requirement and current status on that particular host. Service level agreement is the main thing and essential one for giving(More)
  • 1