Towards Scalable Application Checkpointing with Parallel File System Delegation

@article{Arteaga2011TowardsSA,
  title={Towards Scalable Application Checkpointing with Parallel File System Delegation},
  author={Dulcardo Arteaga and Ming Zhao},
  journal={2011 IEEE Sixth International Conference on Networking, Architecture, and Storage},
  year={2011},
  pages={130-139}
}
The ever-increasing scale of modern high-performance computing (HPC) systems presents a variety of challenges to the parallel file system (PFS) based storage in these systems. The scalability of application check pointing is a particularly important challenge because it is critical to the reliability of computing and it often dominates the I/Os in a HPC system. When a large number of parallel processes simultaneously perform check pointing, the PFS metadata servers can become a serious… CONTINUE READING

Figures, Results, and Topics from this paper.

Key Quantitative Results

  • Experiments with up to 128 parallel processes show that the PFS-delegation based check pointing is significantly faster than the traditional shared-file and file-per-process based check pointing methods (7% and 10% speedup when the underlying PVFS2 uses a centralized metadata server, 22% and 31% speedup when using distributed metadata servers).
  • Experiments with up to 128 parallel processes show that the PFS-delegation based checkpointing is significantly faster than the traditional shared-file and file-per-process based checkpointing methods (7% and 10% speedup when the underlying PVFS2 uses a centralized metadata server; 22% and 31% speedup when using distributed metadata servers).

Citations

Publications citing this paper.
SHOWING 1-4 OF 4 CITATIONS

High-Performance Serverless Data Transfer over Wide-Area Networks

  • 2015 IEEE International Parallel and Distributed Processing Symposium Workshop
  • 2015
VIEW 4 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

References

Publications referenced by this paper.
SHOWING 1-10 OF 21 REFERENCES

PLFS: a checkpoint filesystem for parallel applications

  • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
  • 2009
VIEW 11 EXCERPTS
HIGHLY INFLUENTIAL

PLFS Update presentation

J. Bent, H. Chen, +11 authors P. Nowoczinski
  • Presentation on HEC-FSIO”
  • 2010
VIEW 1 EXCERPT

Modeling the Impact of Checkpoints on Next-Generation Systems

  • 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007)
  • 2007
VIEW 2 EXCERPTS

Lightweight I/O for Scientific Applications

  • 2006 IEEE International Conference on Cluster Computing
  • 2006
VIEW 1 EXCERPT