swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

  title={swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight},
  author={Liandeng Li and Jiarui Fang and Haohuan Fu and Jinlei Jiang and Wenlai Zhao and Conghui He and Xin You and Guangwen Yang},
  journal={2018 IEEE International Conference on Cluster Computing (CLUSTER)},
  • Liandeng Li, Jiarui Fang, +5 authors G. Yang
  • Published 1 September 2018
  • Computer Science
  • 2018 IEEE International Conference on Cluster Computing (CLUSTER)
This paper reports our efforts on swCaffe, a high-efficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, one of the fastest supercomputers in the world that adopts a unique heterogeneous many-core architecture. First, we point out some insightful principles to fully exploit the performance of the innovative many-core architecture. Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on… Expand
swFLOW: A Dataflow Deep Learning Framework on Sunway TaihuLight Supercomputer
  • Han Lin, Zeng Lin, J. M. Diaz, Mingfan Li, Hong An, G. Gao
  • Computer Science
  • 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2019
Deep learning technology is widely used in many modern fields and a number of deep learning models and software frameworks have been proposed. However, it is still very difficult to process deepExpand
Optimization of Parallel Stochastic Gradient Descent on Sunway TaihuLight
It is well known that deep neural networks(DNN) have shown outstanding expressive learning ability in machine learning and other tasks. Stochastic gradient descent(SGD) algorithm is widely used toExpand
swATOP: Automatically Optimizing Deep Learning Operators on SW26010 Many-Core Processor
An end-to-end automated framework called swATOP is presented as a more practical solution for DL operator optimization and is able to bring significant performance improvement on DL operators in over 88% of cases, compared with the best-handcrafted optimization. Expand
Efficient Processing of Convolutional Neural Networks on SW26010
A convolutional neural network optimization method based on the Weight-Stationary for SW26010 processor that achieves a double-precision convolution performance over 2.4 Tflops, achieving 81% of the processor’s peak performance. Expand
swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture
This work is the first attempt from the compiler perspective to bridge the gap of deep learning and high performance architecture particularly with productivity and efficiency in mind and proposes swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. Expand
swGBDT: Efficient Gradient Boosted Decision Tree on Sunway Many-Core Processor
Gradient Boosted Decision Trees (GBDT) is a practical machine learning method, which has been widely used in various application fields such as recommendation system. Optimizing the performance ofExpand
Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system
  • Jia Wei, Xingjun Zhang, Zeyu Ji, Jingbo Li, Zheng Wei
  • Medicine
  • Scientific reports
  • 2021
This work implements and extends LeNet, AlexNet, VGG, and ResNet model training for a single MT- 2000+ and FT-2000+ compute nodes, as well as extended multi-node clusters, and proposes an improved gradient synchronization process for Dynamic Allreduce communication optimization strategy. Expand
Distributed deep learning system for cancerous region detection on Sunway TaihuLight
To explore the potential of distributed training on deep neural networks, several distributed algorithms with the basis of swFlow on the world-leading supercomputer, Sunway TaihuLight are implemented and the great opportunity for joint combination of deep learning and HPC system is revealed. Expand
SWMapper: Scalable Read Mapper on SunWay TaihuLight
A vectorized version of the banded Myers algorithm for pairwise alignment with 256-bit vector registers is presented to fully exploit the computational power of the SW26010 processor, and SWMapper is presented — a scalable and efficient read mapper for the Sunway TaihuLight supercomputer. Expand
Accelerating Sparse Cholesky Factorization on Sunway Manycore Architecture
This article proposes swCholesky, which is a highly optimized implementation of sparse Cholesky factorization on Sunway processor, and designs three kernel task queues and a dense matrix library to dynamically adapt to the kernel characteristics and architecture features. Expand


swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight
To explore the potential of training complex deep neural networks (DNNs) on other commercial chips rather than GPUs, we report our work on swDNN, which is a highly-efficient library for acceleratingExpand
FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters
FireCaffe is presented, which successfully scales deep neural network training across a cluster of GPUs, and finds that reduction trees are more efficient and scalable than the traditional parameter server approach. Expand
S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters
S-Caffe; a scalable and distributed Caffe adaptation for modern multi-GPU clusters is proposed; a co-design of the Caffe framework and the MVAPICH2-GDR MPI runtime that scales up to 160 GPUs. Expand
Deep learning with COTS HPC systems
This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines. Expand
cuDNN: Efficient Primitives for Deep Learning
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms. Expand
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
The API design and the system implementation of MXNet are described, and it is explained how embedding of both symbolic expression and tensor operation is handled in a unified fashion. Expand
Caffe: Convolutional Architecture for Fast Feature Embedding
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Expand
Large Scale Distributed Deep Networks
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training. Expand
The Sunway TaihuLight supercomputer: system and applications
Preliminary efforts on developing and optimizing applications on the TaihuLight system are reported, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation. Expand
CNTK: Microsoft's Open-Source Deep-Learning Toolkit
This tutorial will introduce the Computational Network Toolkit, or CNTK, Microsoft's cutting-edge open-source deep-learning toolkit for Windows and Linux, and show how typical uses looks like for relevant tasks like image recognition, sequence-to-sequence modeling, and speech recognition. Expand