Evaluating On-Node GPU Interconnects for Deep Learning Workloads

  title={Evaluating On-Node GPU Interconnects for Deep Learning Workloads},
  author={Nathan R. Tallent and Nitin Gawande and Charles Martin Siegel and Abhinav Vishnu and Adolfy Hoisie},
Scaling deep learning workloads across multiple GPUs on a single node has become increasingly important in data analytics. A key question is how well a PCIe-based GPU interconnect can perform relative to a custom high-performance interconnect such as NVIDIA’s NVLink. This paper evaluates two such on-node interconnects for eight NVIDIA Pascal P100 GPUs: (a) the NVIDIA DGX-1’s NVLink 1.0 ‘hybrid cube mesh’; and (b) the Cirrascale GX8’s two-level PCIe tree using dual SR3615 switch risers. To show… 

Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite

Evaluation results show that, unless the current CPU-GPU master-slave programming model can be replaced, it is difficult for scale-up multi-GPU applications to really benefit from faster intra-node interconnects such as NVLinks; while for inter-node scale-out applications, although interconnect is more crucial to the overall performance, GPUDirect-RDMA appears to be not always the optimal choice.

Evaluation of On-Node GPU Interconnects for Training Deep Neural Networks

This thesis evaluates the performance of different on-node GPU interconnects: PCIe and NVLink for basic operations involved in training deep neural networks.

Profiling DNN Workloads on a Volta-based DGX-1 System

This work profile and analyze the training of five popular DNNs using 1, 2, 4 and 8 GPUs, and shows the breakdown of the training time across the FP+ BP stage and the WU stage to provide insights about the limiting factors of theTraining algorithm as well as to identify the bottlenecks in the multi-GPU system architecture.

Performance Analysis of Deep Learning Workloads on Leading-edge Systems

  • Yihui RenShinjae YooA. Hoisie
  • Computer Science
    2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)
  • 2019
This work examines the performance of leading-edge systems designed for machine learning computing, including the NVIDIA DGX-2, Amazon Web Services (AWS) P3, IBM Power System Accelerated Compute

Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects

An in-depth analysis of NVLink 2.0 is performed and how to scale a no-partitioning hash join beyond the limits of GPU memory is shown, which shows speed-ups of up to 18x over PCI-e 3.0 and up to 7.3x over an optimized CPU implementation.

Quantifying the NUMA Behavior of Partitioned GPGPU Applications

A framework that allows analyzing the internal communication behavior of GPGPU applications, consisting of an open-source memory tracing plugin for Clang/LLVM, and a simple communication model based on summaries of a kernel's memory accesses that allows reasoning about virtual bandwidth-limited communication paths between NUMA nodes using different partitioning strategies is introduced.

Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences

A detailed performance evaluation and analysis of point-to-point communication using various GPU-aware MPI libraries including SpectrumMPI, OpenMPI+UCX, and MVAPICH2-GDR on OpenPOWER GPU-enabled systems to determine which MPI library can provide the highest performance enhancement.

Efficient MPI‐AllReduce for large‐scale deep learning on GPU‐clusters

Two hierarchical distributed memory multileader AllReduce algorithms optimized for GPU‐accelerated clusters are proposed, in which GPUs inside a computing node perform an intra‐node communication phase to gather and store results of local reduced values to designated GPUs (known as node leaders).

Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

Two hierarchical distributed-memory multi-leader allreduce algorithms optimized for GPU-accelerated clusters, named lr_lr and lr-rab, are exploited and can cut down the execution time of an Allreduce microbenchmark that uses logical ring algorithm (lr).

Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects

Comm|Scope is presented, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios and can serve to update insights about the relative performance of data transfer methods on current systems.



Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

This paper proposes a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/internode multi-GPU communication.

Evaluation of Deep Learning Frameworks Over Different HPC Architectures

This work investigates the performance characteristics of NVIDIA's state-of-the-art hardware technology, NVLink, and also Intel's Knights Landing, the most advanced Intel product for deep learning, with respect to training time and utilization, and provides analysis of the frameworks' performance over different hardware environments in terms of speed and scaling.

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

This paper investigates the performance bottlenecks in existing CUDA-Aware MPI runtimes like MVAPICH2-GDR, and proposes hierarchical collective designs to improve communication latency of the MPI_Bcast primitive by exploiting a new communication library called NCCL.

Fathom: reference workloads for modern deep learning methods

This paper assembles Fathom: a collection of eight archetypal deep learning workloads, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group, and focuses on understanding the fundamental performance characteristics of each model.

Performance Analysis of the Multi-GPU System with ExpEther

A novel multi-GPU system with ExpEther, a virtualization technique which extends PCIe of a host CPU to Ethernet, which revealed that the proposed system with four GPUs achieved 3.88 and 3.29 times performance improvement respectively compared with a single GPU system.

Knights Landing: Second-Generation Intel Xeon Phi Product

This article describes the architecture of Knights Landing, the second-generation Intel Xeon Phi product family, which targets high-performance computing and other highly parallel workloads. It

Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations

It is demonstrated that this approach achieves state-of-the-art performance and exhibits strong scaling for a suite of irregular applications on 8-GPU and heterogeneous systems, yielding over 7x speedup for some algorithms.

Ultra-Performance Pascal GPU and NVLink Interconnect

This article introduces Nvidia's high-performance Pascal GPU. GP100 features in-package high-bandwidth memory, support for efficient FP16 operations, unified memory, and instruction preemption, and

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

Topology-aware image compositing using NVLink

A variety of algorithms have been explored to perform “sort-last” rendering efficiently in order to achieve interactive rendering on massive systems.