A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers

@article{Sato2014AUI,
  title={A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers},
  author={K. Sato and K. Mohror and A. Moody and T. Gamblin and B. Supinski and N. Maruyama and S. Matsuoka},
  journal={2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing},
  year={2014},
  pages={21-30}
}
  • K. Sato, K. Mohror, +4 authors S. Matsuoka
  • Published 2014
  • Computer Science
  • 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
  • Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-performance computing applications that run continuously for hours or days at a time. However, even with state-of-the-art checkpoint/restart techniques, high failure rates at large scale will limit application efficiency. To alleviate the problem, we consider using burst buffers. Burst buffers are dedicated storage resources positioned between the compute nodes and the parallel file system, and this new tier… CONTINUE READING
    An Ephemeral Burst-Buffer File System for Scientific Applications
    30
    BurstFS: A Distributed Burst Buffer File System for Scientific Applications
    6
    Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters
    38
    ECHOFS: A Scheduler-Guided Temporary Filesystem to Leverage Node-Local NVMS
    MPI-IO In-Memory Storage with the Kove XPD
    Contention-Aware Resource Scheduling for Burst Buffer Systems
    2
    How Much SSD Is Useful for Resilience in Supercomputers
    8

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 34 REFERENCES
    On the role of burst buffers in leadership-class storage systems
    280
    Design and modeling of a non-blocking checkpointing system
    79
    Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
    455
    FTI: High performance Fault Tolerance Interface for hybrid systems
    254
    Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system
    26
    Hierarchical Clustering Strategies for Fault Tolerance in Large Scale HPC Systems
    9
    Integrated in-system storage architecture for high performance computing
    17