A 1 PB/s file system to checkpoint three million MPI tasks

@inproceedings{Rajachandrasekar2013A1P,
  title={A 1 PB/s file system to checkpoint three million MPI tasks},
  author={Raghunath Rajachandrasekar and Adam Moody and Kathryn Mohror and Dhabaleswar K. Panda},
  booktitle={HPDC '13},
  year={2013}
}
  • Raghunath Rajachandrasekar, Adam Moody, +1 author Dhabaleswar K. Panda
  • Published in HPDC '13 2013
  • Computer Science
  • With the massive scale of high-performance computing systems, long-running scientific parallel applications periodically save the state of their execution to files called checkpoints to recover from system failures. Checkpoints are stored on external parallel file systems, but limited bandwidth makes this a time-consuming operation. Multilevel checkpointing systems, like the Scalable Checkpoint/Restart (SCR) library, alleviate this bottleneck by caching checkpoints in storage located close to… CONTINUE READING

    Create an AI-powered research feed to stay up to date with new papers like this posted to ArXiv

    Citations

    Publications citing this paper.
    SHOWING 1-10 OF 41 CITATIONS

    Blue Gene/Q defragmentation for energy waste minimisation

    VIEW 5 EXCERPTS
    CITES METHODS
    HIGHLY INFLUENCED

    Ad Hoc File Systems for High-Performance Computing

    VIEW 4 EXCERPTS
    CITES BACKGROUND

    Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales

    VIEW 5 EXCERPTS
    CITES BACKGROUND & RESULTS
    HIGHLY INFLUENCED

    Gfarm/BB — Gfarm File System for Node-Local Burst Buffer

    VIEW 1 EXCERPT
    CITES METHODS

    A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers

    VIEW 1 EXCERPT
    CITES BACKGROUND

    Optimizing the SSD Burst Buffer by Traffic Detection

    VIEW 2 EXCERPTS
    CITES METHODS

    TAZeR: Hiding the Cost of Remote I/O in Distributed Scientific Workflows

    VIEW 2 EXCERPTS
    CITES BACKGROUND

    VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale

    VIEW 1 EXCERPT
    CITES BACKGROUND