• Corpus ID: 67789934

Gradient Scheduling with Global Momentum for Non-IID Data Distributed Asynchronous Training

  title={Gradient Scheduling with Global Momentum for Non-IID Data Distributed Asynchronous Training},
  author={Chengjie Li and Ruixuan Li and Pan Zhou and Haozhao Wang and Yuhua Li and Song Guo and Keqin Li},
Distributed asynchronous offline training has received widespread attention in recent years because of its high performance on large-scale data and complex models. As data are processed from cloud-centric positions to edge locations, a big challenge for distributed systems is how to handle native and natural non-independent and identically distributed (non-IID) data for training. Previous asynchronous training methods do not have a satisfying performance on non-IID data because it would result… 
Local Gradient Aggregation for Decentralized Learning from Non-IID data
This work proposes a Local Gradient Aggregation (LGA) that is a decentralized learning algorithm, where each agent collects the gradient information from its neighboring agents and updates its model with a projected gradient, and demonstrates the efficacy of LGA on non-iid data distributions on benchmark datasets.
Towards Efficient and Stable K-Asynchronous Federated Learning with Unbounded Stale Gradients on Non-IID Data
This paper proposes a two-stage weighted K asynchronous FL with adaptive learning rate (WKAFL), which utilizes stale gradients and mitigates the impact of non-IID data, which can achieve multifaceted enhancement in training speed, prediction accuracy and training stability.
Two-Dimensional Learning Rate Decay: Towards Accurate Federated Learning with Non-IID Data
Two-Dimensional Learning Rate Decay (2D-LRD) is proposed, which aims to improve the model performance by adaptively tuning the learning rate on two dimensions: round-dimension and iteration-dimension during the model training.
Semisupervised Distributed Learning With Non-IID Data for AIoT Service Platform
An edge learning system based on semisupervised learning and federated learning technologies that can have up to 5.9% higher accuracy of object detection for the video analysis applications by fully utilizing unlabeled data, compared with the situation that only labeled data are used.
A Unified Federated Learning Framework for Wireless Communications: towards Privacy, Efficiency, and Security
A two-step federated learning framework, robust federated augmentation and distillation (RFA-RFD), to enable privacy-preserving, communication-efficient, and Byzantine-tolerant on-device machine learning in wireless communications is proposed.
Aggregation Delayed Federated Learning
This work proposes a new aggregation framework for federated learning by introducing redistribution rounds that delay the aggregation and shows that the proposed framework significantly improves the performance on non-IID data.
Cross-Gradient Aggregation for Decentralized Learning from Non-IID data
This work proposes Cross-Gradient Aggregation (CGA), a novel decentralized learning algorithm where each agent aggregates cross-gradient information and updates its model using a projected gradient based on quadratic programming (QP), and theoretically analyze the convergence characteristics of CGA.


Asynchronous Distributed Semi-Stochastic Gradient Optimization
This paper proposes a fast distributed asynchronous SGD-based algorithm with variance reduction that outperforms state-of-the-art distributed asynchronous algorithms in terms of both wall clock time and solution quality.
Petuum: A Framework for Iterative-Convergent Distributed ML
This architecture specifically exploits the fact that many ML programs are fundamentally loss function minimization problems, and that their iterative-convergent nature presents many unique opportunities to minimize loss, such as via dynamic variable scheduling and error-bounded consistency models for synchronization.
Federated Optimization: Distributed Optimization Beyond the Datacenter
We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are distributed (unevenly) over an extremely large
Asynchronous Stochastic Gradient Descent with Delay Compensation
The proposed algorithm is evaluated on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD.
Deep learning with Elastic Averaging SGD
Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.
Federated Optimization: Distributed Machine Learning for On-Device Intelligence
We introduce a new and increasingly relevant setting for distributed optimization in machine learning, where the data defining the optimization are unevenly distributed over an extremely large number
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
This paper finds 99.9% of the gradient exchange in distributed SGD is redundant, and proposes Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth, which enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributedTraining on mobile.
Slow and Stale Gradients Can Win the Race
This work presents a novel theoretical characterization of the speed-up offered by asynchronous SGD methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time).
Large Scale Distributed Deep Networks
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Scalable distributed DNN training using commodity GPU cloud computing
  • N. Strom
  • Computer Science
  • 2015
It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling, and enables efficient scaling to more parallel GPU nodes than any other method that is aware of.