Enabling On-Device Smartphone GPU based Training: Lessons Learned

  title={Enabling On-Device Smartphone GPU based Training: Lessons Learned},
  author={Anish Das and Young D. Kwon and Jagmohan Chauhan and Cecilia Mascolo},
  journal={2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)},
  • Anish Das, Young D. Kwon, Cecilia Mascolo
  • Published 21 February 2022
  • Computer Science
  • 2022 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)
Deep Learning (DL) has shown impressive performance in many mobile applications. Most existing works have focused on reducing the computational and resource overheads of running Deep Neural Networks (DNN) inference on resource-constrained mobile devices. However, the other aspect of DNN operations, i.e. training (forward and backward passes) on smartphone GPUs, has received little attention thus far. To this end, we conduct an initial analysis to examine the feasibility of on-device training on… 

Figures and Tables from this paper


RSTensorFlow: GPU Enabled TensorFlow for Deep Learning on Commodity Android Devices
The result shows that although GPUs on the phones are capable of offering substantial performance gain in matrix multiplication on mobile devices, models that involve multiplication of large matrices can run much faster (approx. 3 times faster in experiments) due to GPU support.
Performance Analysis and Characterization of Training Deep Learning Models on Mobile Devices
A benchmark suite and tools are introduced to study performance of training deep learning models on mobile devices, from the perspectives of memory consumption, hardware utilization, and power consumption, and reveal interesting performance problems and opportunities.
MNN: A Universal and Efficient Inference Engine
The contributions of MNN include presenting a mechanism called pre-inference that manages to conduct runtime optimization that delivers thorough kernel optimization on operators to achieve optimal computation performance and introducing backend abstraction module which enables hybrid scheduling and keeps the engine lightweight.
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
A quantization scheme is proposed that allows inference to be carried out using integer- only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware.
Low-Memory Neural Network Training: A Technical Report
This paper profiles the overall memory usage of training on two representative deep learning benchmarks and comprehensively evaluates four standard techniques for reducing the training memory requirements: imposing sparsity on the model, using low precision, microbatching, and gradient checkpointing.
Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding
This work introduces "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy.
FastICARL: Fast Incremental Classifier and Representation Learning with Efficient Budget Allocation in Audio Sensing Applications
This work develops an end-to-end and on-device IL framework, FastICARL, that incorporates an exemplar-based IL and quantization in the context of audio-based applications and enables complete on- device IL, ensuring user privacy as the user data does not need to leave the device.
PyTorch: An Imperative Style, High-Performance Deep Learning Library
This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Exploring System Performance of Continual Learning for Mobile and Embedded Sensing Applications
The first comprehensive empirical study that quantifies the performance of three predominant continual learning schemes on six datasets from three mobile and embedded sensing applications in a range of scenarios having different learning complexities suggests that replay with exemplars-based schemes such as iCaRL has the best performance trade-offs, even in complex scenarios.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.