JUWELS Booster - A Supercomputer for Large-Scale AI Research

  title={JUWELS Booster - A Supercomputer for Large-Scale AI Research},
  author={Stefan Kesselheim and Andreas Herten and Kai Krajsek and Jan Ebert and Jenia Jitsev and Mehdi Cherti and M. Langguth and Bing Gong and Scarlet Stadtler and Amirpasha Mozaffari and Gabriele Cavallaro and Rocco Sedona and Alexander Schug and Alexandre Otto Strube and Roshni Kamath and Martin G. Schultz and Morris Riedel and Thomas Lippert},
In this article, we present JUWELS Booster, a recently commissioned high-performance computing system at the Jülich Supercomputing Center. With its system architecture, most importantly its large number of powerful Graphics Processing Units (GPUs) and its fast interconnect via InfiniBand, it is an ideal machine for large-scale Artificial Intelligence (AI) research and applications. We detail its system architecture, parallel, distributed model training, and benchmarks indicating its outstanding… 

Figures and Tables from this paper


JUWELS: Modular Tier-0/1 Supercomputer at Jülich Supercomputing Centre
JUWELS is a multi-petaflop modular supercomputer operated by Jülich Supercomputing Centre at Forschungszentrum Jülich as a European and national supercomputing resource for the Gauss Centre for
Exascale Deep Learning for Scientific Inverse Problems
We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping of gradient
Exascale Deep Learning for Climate Analytics
Improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems are described.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.
Horovod: fast and easy distributed deep learning in TensorFlow
Horovod is an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow.
Measuring the Effects of Data Parallelism on Neural Network Training
This work experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error, and study how this relationship varies with the training algorithm, model, and data set, and finds extremely large variation between workloads.
8-Bit Approximations for Parallelism in Deep Learning
8-bit approximation is an efficient method to parallelize convolutional networks on very large systems of GPUs and achieves state-of-the-art speedups for model parallelism.
cuDNN: Efficient Primitives for Deep Learning
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.
ZeRO-Offload: Democratizing Billion-Scale Model Training
ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU, and combines compute and memory efficiency with ease-of-use.
PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
A new low-rank gradient compressor based on power iteration that can compress gradients rapidly, efficiently aggregate the compressed gradients using all-reduce, and achieve test performance on par with SGD is proposed.