• Corpus ID: 232290664

Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More

@article{Daghaghi2021AcceleratingSD,
  title={Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More},
  author={Shabnam Daghaghi and Nicholas Meisburger and Mengnan Zhao and Yong Wu and Sameh Gobriel and Charlie Tai and Anshumali Shrivastava},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.10891}
}
Deep learning implementations on CPUs (Central Processing Units) are gaining more traction. Enhanced AI capabilities on commodity x86 architectures are commercially appealing due to the reuse of existing hardware and virtualization ease. A notable work in this direction is the SLIDE system. SLIDE is a C++ implementation of a sparse hash table based back-propagation, which was shown to be significantly faster than GPUs in training hundreds of million parameter neural models. In this paper, we… 

Figures and Tables from this paper

TOD: TENSOR-BASED OUTLIER DETECTION, A GENERAL GPU-ACCELERATED FRAMEWORK
TLDR
This work proposes TOD, a novel system that abstracts OD algorithms into basic tensor operations for efficient GPU acceleration, and introduces automatic batching, which decomposes OD computations into small batches that can be executed on multiple GPUs in parallel.
Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity
TLDR
This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth and demonstrates several orders of magnitude faster model parallel training than Horovod, the main engine behind most commercial software.
TOD: Tensor-based Outlier Detection
TLDR
This work proposes TOD, a novel system that abstracts OD algorithms into basic tensor operations for efficient GPU acceleration, and introduces automatic batching, which decomposes OD computations into small batches that can be executed on multiple GPUs in parallel.
Does Preprocessing Help Training Over-parameterized Neural Networks?
TLDR
A sophisticated combination of tools in different fields, greedy-type convergence analysis in optimization, sparsity observation in practical work, high-dimensional geometric search in data structure, concentration and anti-concentration in probability is proposed.
Breaking the Linear Iteration Cost Barrier for Some Well-known Conditional Gradient Methods Using MaxIP Data-structures
TLDR
This work provides a formal framework to combine the locality sensitive hashing type approximate MaxIP data-structures with CGM algorithms, and shows the first algorithm, where the cost per iteration is sublinear in the number of parameters, for many fundamental optimization algorithms, e.g., Frank-Wolfe, Herding algorithm, and policy gradient.
Sparse Spiking Gradient Descent
TLDR
This work presents the first sparse SNN backpropagation algorithm which achieves the same or better accuracy as current state of the art methods while being significantly faster and more memory efficient.
Reduced-Precision Acceleration of Radio-Astronomical Imaging on Reconfigurable Hardware
TLDR
A reduced-precision implementation of the gridding component of the widely-used WSClean imaging application and proposes the first custom floating-point accelerator on a Xilinx Alveo U50 FPGA using High-Level Synthesis.
Scalable algorithms for physics-informed neural and graph networks
TLDR
Some of the prevailing trends in embedding physics into machine learning are reviewed, using physics-informed neural networks (PINNs) based primarily on feed-forward neural networks and automatic differentiation and graph neural networks based on graph exterior calculus to construct differential operators.
Sublinear Least-Squares Value Iteration via Locality Sensitive Hashing
TLDR
This work builds the connections between the theory of approximate maximum inner product search and the regret analysis of reinforcement learning, and presents the first provable Least-Squares Value Iteration algorithms that achieves runtime complexity sublinear in the number of actions.
...
...

References

SHOWING 1-10 OF 24 REFERENCES
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems
TLDR
This paper proposes SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, with multi-core parallelism and workload optimization, using just a CPU, outperforming an optimized implementation of Tensorflow (TF) on the best available GPU.
Scalable and Sustainable Deep Learning via Randomized Hashing
TLDR
This work presents a novel hashing-based technique to drastically reduce the amount of computation needed to train and test neural networks, and demonstrates the scalability and sustainability (energy efficiency) of the proposed algorithm via rigorous experimental evaluations on several datasets.
A Study of BFLOAT16 for Deep Learning Training
TLDR
The results show that deep learning training using BFLOAT16 tensors achieves the same state-of-the-art (SOTA) results across domains as FP32 tensors in the same number of iterations and with no changes to hyper-parameters.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
TLDR
This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.
A comparison of software and hardware techniques for x86 virtualization
TLDR
It is found that the hardware support for Virtual Machine Monitors for x86 fails to provide an unambiguous performance advantage for two primary reasons: first, it offers no support for MMU virtualization; second, it fails to co-exist with existing software techniques for MM U virtualization.
Densified Winner Take All (WTA) Hashing for Sparse Datasets
TLDR
This paper identifies a subtle issue with WTA, which grows with the sparsity of the datasets, and proposes a solution based on the idea of Densification which makes use of 2-universal hash functions in a novel way.
Language Models are Few-Shot Learners
TLDR
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Adaptive dropout for training deep neural networks
TLDR
A method is described called 'standout' in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero, which achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines.
Adaptive Sampled Softmax with Kernel Based Sampling
TLDR
This work proposes a new class of kernel based sampling methods and develops an efficient sampling algorithm that adapts to the model as it is trained, thus resulting in low bias and empirically studies the trade-off of bias, sampling distribution and sample size.
Efficient Estimation of Word Representations in Vector Space
TLDR
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
...
...