In-datacenter performance analysis of a tensor processing unit

@article{Jouppi2017IndatacenterPA,
  title={In-datacenter performance analysis of a tensor processing unit},
  author={Norman P. Jouppi and Cliff Young and Nishant Patil and David A. Patterson and Gaurav Agrawal and Raminder Singh Bajwa and Sarah Bates and Suresh Bhatia and Nanette J. Boden and Al Borchers and Rick Boyle and Pierre-luc Cantin and Clifford Chao and Chris Clark and Jeremy Coriell and Mike Daley and Matt Dau and Jeffrey Dean and Ben Gelb and Tara Vazir Ghaemmaghami and Rajendra Gottipati and William Gulland and Robert B. Hagmann and C. Richard Ho and Doug Hogberg and John Hu and Robert Hundt and Daniel Hurt and Julian Ibarz and Aaron Jaffey and Alek Jaworski and Alexander Kaplan and Harshit Khaitan and Daniel Killebrew and Andy Koch and Naveen Kumar and Steve Lacy and James Laudon and James Law and Diemthu Le and Chris Leary and Zhuyuan Liu and Kyle A. Lucke and Alan Lundin and Gordon MacKean and Adriana Maggiore and Maire Mahony and Kieran Miller and Rahul Nagarajan and Ravi Narayanaswami and Ray Ni and Kathy Nix and Thomas Norrie and Mark Omernick and Narayana Penukonda and Andy Phelps and Jonathan Ross and Matt Ross and Amir Salek and Emad Samadiani and Chris Severn and Gregory Sizikov and Matthew Snelham and J. W. Souter and Dan Steinberg and Andy Swing and Mercedes Tan and Gregory Thorson and Bo Tian and Horia Toma and Erick Tuttle and Vijay Vasudevan and Richard Walter and Walter Wang and Eric Wilcox and Doe Hyun Yoon},
  journal={2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)},
  year={2017},
  pages={1-12}
}
  • N. Jouppi, C. Young, +73 authors D. Yoon
  • Published 2017
  • Computer Science
  • 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution… Expand
GreenTPU: Improving Timing Error Resilience of a Near-Threshold Tensor Processing Unit
TLDR
This work proposes GreenTPU--- a low-power near-threshold (NTC) TPU design paradigm, which enables 2X -3X higher performance in an NTC TPU, with a minimal loss in the prediction accuracy. Expand
GreenTPU: Predictive Design Paradigm for Improving Timing Error Resilience of a Near-Threshold Tensor Processing Unit
TLDR
GreenTPU—a low-power near-threshold (NTC) TPU design paradigm that identifies the patterns in the error-causing activation sequences in the systolic array, and prevents further timing errors from similar patterns by intermittently boosting the operating voltage of the specific multiplier-and-accumulator units in the TPU. Expand
High performance Monte Carlo simulation of ising model on TPU clusters
TLDR
A novel approach using TensorFlow on Cloud TPU to simulate the two-dimensional Ising Model and it is demonstrated that using low precision arithmetic---bfloat16---does not compromise the correctness of the simulation results. Expand
Beyond Peak Performance: Comparing the Real Performance of AI-Optimized FPGAs and GPUs
TLDR
The first performance evaluation of Intel's AI- Optimized FPGA, the Stratix 10 NX, in comparison to the latest accessible AI-optimized GPUs, the Nvidia T4 and V100, on a large suite of real-time DL inference workloads is presented. Expand
Accelerating reduction and scan using tensor core units
TLDR
This paper is the first to try to broaden the class of algorithms expressible as TCU operations and is theFirst to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. Expand
Proximu: Efficiently Scaling DNN Inference in Multi-core CPUs through Near-Cache Compute
TLDR
Proximu enables unprecedented CPU efficiency gains while achieving similar performance to state-of-the-art Domain Specific Accelerators (DSA) for DNN inference in this AI era. Expand
EFFORT: Enhancing Energy Efficiency and Error Resilience of a Near-Threshold Tensor Processing Unit
TLDR
EFFORT is proposed—an energy optimized, yet high performance TPU architecture, operating at the Near-Threshold Computing (NTC) region, that enables up to 2.5× better performance at NTC with only 2% average accuracy drop across 3 out of 4 DNN datasets. Expand
TPUPoint: Automatic Characterization of Hardware-Accelerated Machine-Learning Behavior for Cloud Computing
TLDR
TPUPoints advantages significantly increase the potential for discovering optimal parameters to quickly balance the complex workload pipeline of feeding data into a system, reformatting the data, and computing results. Expand
Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads
TLDR
The TSP architecture is introduced, a functionally-sliced microarchitecture with memory units interleaved with vector and matrix deep learning functional units in order to take advantage of dataflow locality of deep learning operations. Expand
TRIP : An Ultra-Low Latency , TeraOps / s Reconfigurable Inference Processor for Real-Time Multi-Layer Perceptrons
Multi-Layer Perceptron (MLP) is one of the most commonly deployed Deep Neural Network, representing 61% of the workload in Google data centers [1]. MLPs have low arithmetic intensity which results inExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 125 REFERENCES
In-Datacenter Performance Analysis of a Tensor Processing UnitTM
Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC—called a ​Tensor Processing Unit (TPU) ​ —Expand
DaDianNao: A Machine-Learning Supercomputer
  • Yunji Chen, Tao Luo, +8 authors O. Temam
  • Computer Science
  • 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture
  • 2014
TLDR
This article introduces a custom multi-chip machine-learning architecture, showing that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. Expand
EIE: Efficient Inference Engine on Compressed Deep Neural Network
  • Song Han, Xingyu Liu, +4 authors W. Dally
  • Computer Science
  • 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
  • 2016
TLDR
An energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing and is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. Expand
Fathom: reference workloads for modern deep learning methods
TLDR
This paper assembles Fathom: a collection of eight archetypal deep learning workloads, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group, and focuses on understanding the fundamental performance characteristics of each model. Expand
Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give anExpand
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
TLDR
This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint. Expand
Memory-centric accelerator design for Convolutional Neural Networks
TLDR
It is shown that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload and ensures that on-chip memory size is minimized, which reduces area and energy usage. Expand
Accelerating Deep Convolutional Neural Networks Using Specialized Hardware
TLDR
Hardware specialization in the form of GPGPUs, FPGAs, and ASICs offers a promising path towards major leaps in processing capability while achieving high energy efficiency, and combining multiple FPGA over a low-latency communication fabric offers further opportunity to train and evaluate models of unprecedented size and quality. Expand
Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing
TLDR
Cnvolutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. Expand
Origami: A Convolutional Network Accelerator
TLDR
This paper presents the first convolutional network accelerator which is scalable to network sizes that are currently only handled by workstation GPUs, but remains within the power envelope of embedded systems. Expand
...
1
2
3
4
5
...