Corpus ID: 2550289

ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders

@article{CarreiraPerpin2019ParMACDO,
  title={ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders},
  author={Miguel {\'A}. Carreira-Perpi{\~n}{\'a}n and Mehdi Alizadeh},
  journal={ArXiv},
  year={2019},
  volume={abs/1605.09114}
}
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such “nested” functions is the method of auxiliary coordinates (MAC) (Carreira-Perpiñán and Wang, 2014). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate… Expand
PARMAC: DISTRIBUTED OPTIMISATION
Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach toExpand
LocoProp: Enhancing BackProp via Local Loss Optimization
TLDR
A local loss construction approach for optimizing neural networks is studied and it is shown that the construction consistently improves convergence, reducing the gap between first-order and second-order methods. Expand
Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training
TLDR
This model represents activation functions as equivalent biconvex constraints and uses Lagrange Multipliers to arrive at a rigorous lower bound of the traditional neural network training problem. Expand
Improving CTC Using Stimulated Learning for Sequence Modeling
  • Jahn Heymann, K. Sim, Bo Li
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
Connectionist temporal classification (CTC) is a sequence-level loss that has been successfully applied to train recurrent neural network (RNN) models for automatic speech recognition. However, oneExpand
Training Deep Architectures Without End-to-End Backpropagation: A Brief Survey
TLDR
This tutorial paper surveys training alternatives to end-to-end backpropagation (E2EBP) — the de facto standard for training deep architectures that allow for greater modularity and transparency in deep learning workflows, aligning deep learning with the mainstream computer science engineering that heavily exploits modularization for scalability. Expand

References

SHOWING 1-10 OF 72 REFERENCES
Distributed optimization of deeply nested systems
TLDR
This work describes a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC), which replaces the original problem involving a deeply nested function with a constrained problem involved in an augmented space without nesting. Expand
Large Scale Distributed Deep Networks
TLDR
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training. Expand
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
TLDR
It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. Expand
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs
TLDR
This work shows empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback), and implements data-parallel deterministically distributed SGD by combining this finding with AdaGrad. Expand
A fast, universal algorithm to learn parametric nonlinear embeddings
TLDR
Using the method of auxiliary coordinates, a training algorithm is derived that works by alternating steps thatTrain an auxiliary embedding with steps that train the mapping, and it can reuse N-body methods developed for nonlinear embeddings, yielding linear-time iterations. Expand
Petuum: A New Platform for Distributed Machine Learning on Big Data
TLDR
This work proposes a general-purpose framework, Petuum, that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. Expand
Learning both Weights and Connections for Efficient Neural Network
TLDR
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method. Expand
Distributed Coordinate Descent Method for Learning with Big Data
TLDR
This paper develops and analyzes Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data, and gives bounds on the number of iterations sufficient to approximately solve the problem with high probability. Expand
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
TLDR
This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. Expand
Asynchronous stochastic gradient descent for DNN training
TLDR
This paper describes an effective approach to achieve an approximation of BP - asynchronous stochastic gradient descent (ASGD), which is used to parallelize computing on multi-GPU, which achieves a 3.2 times speed-up on 4 GPUs than the single one, without any recognition performance loss. Expand
...
1
2
3
4
5
...