• Publications
  • Influence
In-datacenter performance analysis of a tensor processing unit
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters. Expand
McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures
Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP. Expand
NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory
NVSim is developed, a circuit-level model for NVM performance, energy, and area estimation, which supports various NVM technologies, including STT-RAM, PCRAM, ReRAM, and legacy NAND Flash and is expected to help boost architecture-level NVM-related studies. Expand
Corona: System Implications of Emerging Nanophotonic Technology
This work believes that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memory intensive workloads, while simultaneously reducing power. Expand
CACTI 6.0: A Tool to Model Large Caches
This report details the analytical model assumed for the newly added modules along with their validation analysis of CACTI 6.0, a significantly enhanced version of the tool that primarily focuses on interconnect design for large caches. Expand
Cacti: An enhanced cache ac - cess and cycle time model
In a clock having a synchronous motor and a chime that is struck once evey half hour by a biased hammer assembly that is prevented from striking the chime except when a tab on the hammer falls into aExpand
Cacti 3. 0: an integrated cache timing, power, and area model
tion in 1982. We focus on information technology that is relevant to the technical strategy of the Corporation, and that has the potential to open new business opportunities. Research at WRL includesExpand
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers
  • N. Jouppi
  • Computer Science
  • [] Proceedings. The 17th Annual International…
  • 1990
Hardware techniques for improving the performance of caches are presented and stream buffers prefetch cache lines starting at a cache miss address, which are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Expand
Complexity-Effective Superscalar Processors
A microarchitecture that simplifies wakeup and selection logic is proposed and discussed, which will help minimize performance degradation due to slow bypasses in future wide-issue machines. Expand
Single-ISA heterogeneous multi-core architectures for multithreaded workload performance
This paper examines two single-ISA heterogeneous multi-core architectures in detail, demonstrating dynamic core assignment policies that provide significant performance gains over naive assignment, and even outperform the best static assignment. Expand