Learn More
SUMMARY With the evolution of high-performance computing towards heterogeneous, massively parallel systems, parallel applications have developed new checkpoint and restart necessities. Whether due to a failure in the execution or to a migration of the application processes to different machines, checkpointing tools must be able to operate in heterogeneous(More)
(NCSER) supports a comprehensive research program to promote the highest quality and rigor in research on special education and related services, and to address the full range of issues facing children with disabilities, parents of children with disabilities, school personnel, and others. We strive to make our products available in a variety of formats and(More)
A 1 trillion node internet of things (IoT) will require sensing platforms that support numerous applications using power harvesting to avoid the cost and scalability challenge of battery replacement in such large numbers. Previous SoCs achieve good integration and even energy harvesting [1][2][3], but they limit supported applications, need higher(More)
SUMMARY This paper presents CPPC (Controller/Pre-compiler for Portable Checkpointing), a checkpointing tool designed for heterogeneous clusters and Grid infrastructures through the use of portable protocols, portable checkpoint files and portable code. It works at variable level being user-directed, thus generating small checkpoint files. It allows parallel(More)
Checkpointing tools may be typically implemented at two different abstraction levels: at the system level or at the application level. The latter has become a more popular alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert(More)
The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to(More)
The running times of large-scale computational science and engineering parallel applications, executed on clusters or grid platforms, are usually longer than the mean-time-between-failures (MTBF). Hardware failures must be tolerated by the parallel applications to ensure that no all computation done is lost on machine failures. Checkpointing and rollback(More)