• Publications
  • Influence
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed
TLDR
This paper presents Grid'5000, a 5000 CPU nation-wide infrastructure for research in Grid computing, designed to provide a scientific tool for computer scientists similar to the large-scale instruments used by physicists, astronomers, and biologists.
XtremWeb: a generic global computing system
TLDR
The paper presents the design of XtremWeb and presents two essential features of this design are multi-applications and high-performance, which are ensured by scalability, fault tolerance, efficient scheduling and a large base of volunteer PCs.
Grid'5000: a large scale and highly reconfigurable grid experimental testbed
TLDR
The motivations, design, architecture, configuration examples of Grid'5000, a 5000 CPUs nation-wide infrastructure for research in Grid computing, are described and performance results for the reconfiguration subsystem are described.
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
TLDR
This work presents MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging, and presents a detailed performance evaluation of every component and its global performance for non-trivial parallel applications.
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
TLDR
Experimental results demonstrate that MPICH-V2 provides performance close toMPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPich-V1.
Toward Exascale Resilience
TLDR
This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.
FTI: High performance Fault Tolerance Interface for hybrid systems
TLDR
This work proposes a low-overhead high-frequency multi-level checkpoint technique in which a highly-reliable topology-aware Reed-Solomon encoding in a three- level checkpoint scheme is integrated in the Fault Tolerance Interface FTI.
Fast Error-Bounded Lossy HPC Data Compression with SZ
TLDR
This paper proposes a novel HPC data compression method that works very effectively on compressing large-scale HPCData sets, and evaluates it using 13 real-world HPC applications across different scientific domains, and compared to many other state-of-the-art compression methods.
The International Exascale Software Project roadmap
TLDR
The work of the community to prepare for the challenges of exascale computing is described, ultimately combing their efforts in a coordinated International Exascale Software Project.
...
...