Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

@article{Awan2018OptimizedBF,
  title={Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?},
  author={Ammar Ahmad Awan and Ching-Hsiang Chu and Hari Subramoni and Dhabaleswar K. Panda},
  journal={CoRR},
  year={2018},
  volume={abs/1707.09414}
}
Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special… CONTINUE READING