Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices: Design Considerations

  title={Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices: Design Considerations},
  author={G{\"o}kmen Tayfun and Yurii A. Vlasov},
  journal={Frontiers in Neuroscience},
In recent years, deep neural networks (DNN) have demonstrated significant business impact in large scale analysis and classification tasks such as speech recognition, visual object detection, pattern extraction, etc. [] Key Result A system consisting of a cluster of RPU accelerators will be able to tackle Big Data problems with trillions of parameters that is impossible to address today like, for example, natural speech recognition and translation between all world languages, real-time analytics on large…

Figures and Tables from this paper

Zero-shifting Technique for Deep Neural Network Training on Resistive Cross-point Arrays
A concept of symmetry point is introduced and a zero-shifting technique is proposed which can compensate imbalance by programming the reference device and changing the zero value point of the weight and it is shown that network performance dramatically improves for imbalanced synapse devices.
TxSim: Modeling Training of Deep Neural Networks on Resistive Crossbar Systems
TxSim is proposed, a fast and customizable modeling framework to functionally evaluate DNN training on crossbar-based hardware considering the impact of nonidealities and achieves computational efficiency by mapping crossbar evaluations to well-optimized Basic Linear Algebra Subprograms routines and incorporates speedup techniques to further reduce simulation time with minimal impact on accuracy.
Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices
This work shows how to map the convolutional layers to fully connected RPU arrays such that the parallelism of the hardware can be fully utilized in all three cycles of the backpropagation algorithm.
Perspective on training fully connected networks with resistive memories: Device requirements for multiple conductances of varying significance
Simulations to evaluate the final generalization accuracy of a trained four-neuron-layer fully-connected network quantify the required dynamic range, the tolerable device-to-device variability in both maximum conductance andmaximum conductance change, the tolerateable pulse- to-pulse variability in conductance changes, and the tolerably device yield.
Algorithm for Training Neural Networks on Resistive Device Arrays
A new training algorithm, so-called the “Tiki-Taka” algorithm, is presented that eliminates this stringent symmetry requirement for resistive crossbar arrays and maintains the aforementioned power and speed benefits.
Analog CMOS-based resistive processing unit for deep neural network training
An analog CMOS-based RPU design (CMOS RPU) is proposed which can store and process data locally and can be operated in a massively parallel manner and evaluated the functionality and feasibility for acceleration of DNN training.
Design and characterization of superconducting nanowire-based processors for acceleration of deep neural network training
The superconducting nanowire-based processing element as a crosspoint device has many programmable non-volatile states that can be used to perform analog multiplication, and these states are intrinsically discrete due to quantization of flux, which provides symmetric switching characteristics.
Vesti: Energy-Efficient In-Memory Computing Accelerator for Deep Neural Networks
A new DNN accelerator is designed to support configurable multibit activations and large-scale DNNs seamlessly while substantially improving the chip-level energy-efficiency with favorable accuracy tradeoff compared to conventional digital ASIC.
Training LSTM Networks With Resistive Cross-Point Devices
This work further extends the RPU concept for training recurrent neural networks (RNNs) namely LSTMs and finds that RPU device variations and hardware noise are enough to mitigate overfitting, so that there is less need for using dropout.


On-Chip Sparse Learning Acceleration With CMOS and Resistive Synaptic Devices
This paper cooptimizes algorithm, architecture, circuit, and device for real-time energy-efficient on-chip hardware acceleration of sparse coding and shows that 65 nm implementation of the CMOS ASIC and PARCA scheme accelerates sparse coding computation by 394 and 2140×, respectively, compared to software running on a eight-core CPU.
A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks
The nn-X system is presented, a scalable, low-power coprocessor for enabling real-time execution of deep neural networks, able to achieve a peak performance of 227 G-ops/s, which translates to a performance per power improvement of 10 to 100 times that of conventional mobile and desktop processors.
DaDianNao: A Machine-Learning Supercomputer
  • Yunji Chen, Tao Luo, O. Temam
  • Computer Science
    2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
  • 2014
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system.
Large-scale neural networks implemented with non-volatile memory as the synaptic weight element: Comparative performance analysis (accuracy, speed, and power)
It is shown that NVM-based systems could potentially offer faster and lower-power ML training than GPU-based hardware, despite the inherent random and deterministic imperfections of such devices.
Scaling-up resistive synaptic arrays for neuro-inspired architecture: Challenges and prospect
A circuit-level macro simulator is developed to explore the design trade-offs and evaluate the overhead of the proposed mitigation strategies as well as project the scaling trend of the neuro-inspired architecture.
Memristor-Based Multilayer Neural Networks With Online Gradient Descent Training
The utility and robustness of the proposed memristor-based circuit can compactly implement hardware MNNs trainable by scalable algorithms based on online gradient descent (e.g., backpropagation).
Training and operation of an integrated neuromorphic network based on metal-oxide memristors
The experimental implementation of transistor-free metal-oxide memristor crossbars, with device variability sufficiently low to allow operation of integrated neural networks, in a simple network: a single-layer perceptron (an algorithm for linear classification).
Deep learning with COTS HPC systems
This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines.
Experimental Demonstration and Tolerancing of a Large-Scale Neural Network (165 000 Synapses) Using Phase-Change Memory as the Synaptic Weight Element
It is shown that a bidirectional NVM with a symmetric, linear conductance response of high dynamic range is capable of delivering the same high classification accuracies on this problem as a conventional, software-based implementation of this same network.
A generic systolic array building block for neural networks with on-chip learning
The two-dimensional systolic array system presented is an attempt to define a novel computer architecture inspired by neurobiology that is composed of generic building blocks for basic operations rather than predefined neural models.