Ternary MobileNets via Per-Layer Hybrid Filter Banks

  title={Ternary MobileNets via Per-Layer Hybrid Filter Banks},
  author={Dibakar Gope and Jesse G. Beu and Urmish Thakker and Matthew Mattina},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
MobileNets family of computer vision neural networks have fueled tremendous progress in the design and organization of resource-efficient architectures in recent years. New applications with stringent real-time requirements on highly constrained devices require further compression of MobileNets-like compute-efficient networks. Model quantization is a widely used technique to compress and accelerate neural network inference and prior works have quantized MobileNets to 4 − 6 bits, albeit with a… Expand
Aggressive Compression of MobileNets Using Hybrid Ternary Layers
Problem to be solved: In a neural network with binary (-1, 1) or ternary (-1, 0, 1) weights, multiplications are replaced by additions. Multipliers consume significantly more area and energy thanExpand
Symmetric $k$-Means for Deep Neural Network Compression and Hardware Acceleration on FPGAs
A novel Symmetric means based compression algorithm that is specifically designed to support a new FPGA-based hardware acceleration scheme by reducing the number of inference-time multiply-accumulate (MAC) operations by up to 98%. Expand
Run-Time Efficient RNN Compression for Inference on Edge Devices
A new compressed RNN cell implementation called Hybrid Matrix Decomposition (HMD) is explored that results in faster inference runtime than pruning and better accuracy than matrix factorization for compression factors of 2-4x. Expand
High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands
A new SIMD matrix multiplication instruction that uses mixed precision on its inputs and accumulates product values into narrower 16-bit output accumulators, in turn allowing the SIMD operation at 128-bit vector width to process a greater number of data elements per instruction to improve processing throughput and memory bandwidth utilization without increasing the register read- and write-port bandwidth in CPUs. Expand
Doping: A technique for efficient compression of LSTM models using sparse structured additive matrices
Emp empirical evidence is provided to show that doping, CMA and CMR are concepts generally applicable to multiple structured matrices (Kronecker Product, LMF, Hybrid Matrix Decomposition) and results with doped kronecker product matrices demonstrate state-of-the-art accuracy at large compression factors across 4 natural language processing applications with minor loss in accuracy. Expand
Rank and run-time aware compression of NLP Applications
A new compression technique called Hybrid Matrix Factorization (HMF) is proposed that improves low-rank matrix factorization techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF and leads to faster inference run-time than pruning or structure matrix based compression technique. Expand
Compressing Language Models using Doped Kronecker Products
A way to recover accuracy otherwise lost when applying KP to large NLP tasks, by allowing additional degrees of freedom in the KP matrix, through doping, a process of adding an extremely sparse overlay matrix on top of the pre-defined KP structure. Expand
MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers
This paper employs differentiable NAS (DNAS) to search for models with low memory usage and low op count, where op count is treated as a viable proxy to latency, and obtains state-of-the-art results for all three TinyMLperf industry-standard benchmark tasks. Expand
Pushing the Envelope of Dynamic Spatial Gating technologies
This paper focuses on one such technology that targets unimportant features in the spatial domain of OFM, called Precision Gating (PG), and shows that PG leads to loss in accuracy when the authors push the MAC reduction achieved by a PG network. Expand
Understanding the Impact of Dynamic Channel Pruning on Conditionally Parameterized Convolutions
This paper analyzes a recent method, Feature Boosting and Suppression (FBS), which dynamically assesses which channels contain the most important input-dependent features and prune the others based on a runtime threshold gating mechanism and discovers that substituting standard convolutional filters with input-specific filters, as described in CondConv, enables FBS to address this accuracy loss. Expand


Multi-Precision Quantized Neural Networks via Encoding Decomposition of -1 and +1
A novel encoding scheme of using {-1,+1} to decompose quantized neural networks (QNNs) into multi-branch binary networks, which can be efficiently implemented by bitwise operations (xnor and bitcount) to achieve model compression, computational acceleration and resource saving. Expand
Aggressive Compression of MobileNets Using Hybrid Ternary Layers
Problem to be solved: In a neural network with binary (-1, 1) or ternary (-1, 0, 1) weights, multiplications are replaced by additions. Multipliers consume significantly more area and energy thanExpand
HAQ: Hardware-Aware Automated Quantization With Mixed Precision
The Hardware-Aware Automated Quantization (HAQ) framework is introduced which leverages the reinforcement learning to automatically determine the quantization policy, and takes the hardware accelerator's feedback in the design loop to generate direct feedback signals to the RL agent. Expand
Trained Ternary Quantization
This work proposes Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values to improve the accuracy of some models (32, 44, 56-layer ResNet) on CIFAR-10 and AlexNet on ImageNet. Expand
Low-bit Quantization of Neural Networks for Efficient Inference
This paper formalizes the linear quantization task as a Minimum Mean Squared Error (MMSE) problem for both weights and activations, allowing low-bit precision inference without the need for full network retraining. Expand
Quantization Networks
  • Jiwei Yang, Xu Shen, +5 authors Xiansheng Hua
  • Computer Science, Mathematics
  • 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
This paper provides a simple and uniform way for weights and activations quantization by formulating it as a differentiable non-linear function that will shed new lights on the interpretation of neural network quantization. Expand
Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks
Differentiable Soft Quantization (DSQ) is proposed to bridge the gap between the full-precision and low-bit networks and can help pursue the accurate gradients in backward propagation, and reduce the quantization loss in forward process with an appropriate clipping range. Expand
Ternary neural networks for resource-efficient AI applications
This paper proposes ternary neural networks (TNNs) in order to make deep learning more resource-efficient, and designs a purpose-built hardware architecture for TNNs and implements it on FPGA and ASIC. Expand
YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration
This paper presents an accelerator optimized for binary-weight CNNs that significantly outperforms the state-of-the-art in terms of energy and area efficiency and removes the need for expensive multiplications, as well as reducing I/O bandwidth and storage. Expand
BinaryConnect: Training Deep Neural Networks with binary weights during propagations
BinaryConnect is introduced, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated, and near state-of-the-art results with BinaryConnect are obtained on the permutation-invariant MNIST, CIFAR-10 and SVHN. Expand