Corpus ID: 691340

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs

  title={CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs},
  author={Liangzhen Lai and Naveen Suda and Vikas Chandra},
Deep Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication. This paper presents CMSIS-NN, efficient kernels developed to maximize the performance and minimize the memory footprint of neural network (NN) applications on Arm Cortex-M processors targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X… Expand
Neural Network Application Code Convolution Data type conversion NNFunctions Activation tables NNSupportFunctions Pooling Fully-connected Activations
Machine learning (ML) algorithms are moving to the IoT edge due to various considerations such as latency, power consumption, cost, network bandwidth, reliability, privacy and security. Hence, thereExpand
AIoT Solution Survey and Comparison in Machine Learning on Low-cost Microcontroller
This paper will compare CMSIS-NN and uTensor: low energy consumption microcontrollers, a collection of efficient kernels developed to maximize the performance and minimize the memory footprint of Neural Network applications on ARM Cortex-M processors for intelligent IoT edge devices. Expand
XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes
This work introduces lightweight extensions to the RISC-V ISA to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores by extending the ISA with nibble and crumb SIMD instructions, and proposes a custom execution paradigm for SIMD sum-of-dot-product operations. Expand
ARM Embedded Low Cost Solution for Implementing Deep Learning Paradigms
Deep neural networks have become a topic of great interest to scientific research groups due to their applicability in a wide range of fields. Advanced technologies of both software and integratedExpand
Efficient Neural Network Deployment for Microcontroller
This paper is going to explore and generalize convolution neural network deployment for microcontrollers with two novel optimization proposals offering memory saving and compute efficiency in 2D convolutions as well as fully connected layers. Expand
Enabling mixed-precision quantized neural networks in extreme-edge devices
This work presents an extension to the PULP-NN library targeting the acceleration of mixed-precision Deep Neural Networks, an emerging paradigm able to significantly shrink the memory footprint of deep neural networks with negligible accuracy loss. Expand
CMix-NN: Mixed Low-Precision CNN Library for Memory-Constrained Edge Devices
This brief presents CMix-NN, a flexible open-source mixed low-precision (independent tensors quantization of weight and activations at 8, 4, 2 bits) inference library for low bitwidth Quantized Networks. Expand
Machine Learning (ML) functions are becoming ubiquitous in latencyand privacy-sensitive IoT applications, prompting for a shift toward near-sensor processing at the extreme edge and the consequentExpand
Arbitrary-Precision Convolutional Neural Networks on Low-Power IoT Processors
Virtual Quantization (VQ) is introduced, a hardware-friendly compression method which allows to implement equivalent n-ary fixed-point quantization CNNs on general purpose instruction-set architectures. Expand
Soft Error Reliability Assessment of Neural Networks on Resource-constrained IoT Devices
Results show that the soft error reliability of a convolutional neural network developed based on the Arm CMSIS-NN library varies depending on the instruction set architecture and the layer where the faults are injected. Expand


Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs
This work studies the memory efficiency of various CNN layers and reveals the performance implication from both data layouts and memory access patterns, which shows the universal effect of the proposed optimizations on both single layers and various networks. Expand
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
This work presents a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGAs resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. Expand
Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations
It is shown that using floating-point numbers for weights is more efficient than fixed-point representation for the same bit-width and enables compact hardware multiply-and-accumulate (MAC) unit design. Expand
Hello Edge: Keyword Spotting on Microcontrollers
It is shown that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy, and the depthwise separable convolutional neural network (DS-CNN) is explored and compared against other neural network architecture. Expand
Caffe: Convolutional Architecture for Fast Feature Embedding
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Expand
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization. Expand
Fixed Point Quantization of Deep Convolutional Networks
This paper proposes a quantizer design for fixed point implementation of DCNs, formulate and solve an optimization problem to identify optimal fixed point bit-width allocation across DCN layers, and demonstrates that fine-tuning can further enhance the accuracy of fixed point DCNs beyond that of the original floating point model. Expand
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields. Expand
Edge Computing: Vision and Challenges
The definition of edge computing is introduced, followed by several case studies, ranging from cloud offloading to smart home and city, as well as collaborative edge to materialize the concept of edge Computing. Expand
The route to a trillion devices