NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA

@article{Nukada2011NVCRAT,
  title={NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA},
  author={Akira Nukada and Hiroyuki Takizawa and Satoshi Matsuoka},
  journal={2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum},
  year={2011},
  pages={104-113}
}
Today, CUDA is the de facto standard programming framework to exploit the computational power of graphics processing units (GPUs) to accelerate various kinds of applications. For efficient use of a large GPU-accelerated system, one important mechanism is checkpoint-restart that can be used not only to improve fault tolerance but also to optimize node/slot allocation by suspending a job on one node and migrating the job to another node. Although several checkpoint-restart implementations have… CONTINUE READING

Citations

Publications citing this paper.
SHOWING 1-10 OF 18 CITATIONS

References

Publications referenced by this paper.
SHOWING 1-10 OF 14 REFERENCES

TSUBAME 2.0 Begins – The long road from TSUBAME 1.0 to 2.0 (Part One)

  • S. Matsuoka
  • TSUBAME e-Science Journal, vol. 2, pp. 19– 27, .
  • 2010
1 Excerpt

Similar Papers

Loading similar papers…