Toward Dark Silicon in Servers

  title={Toward Dark Silicon in Servers},
  author={Nikolaos Hardavellas and Michael Ferdman and Babak Falsafi and Anastasia Ailamaki},
  journal={IEEE Micro},
Server chips will not scale beyond a few tens to low hundreds of cores, and an increasing fraction of the chip in future technologies will be dark silicon that we cannot afford to power. Specialized multicore processors, however, can leverage the underutilized die area to overcome the initial power barrier, delivering significantly higher performance for the same bandwidth and power envelopes. 

Figures and Tables from this paper

A Reconfigurable NoC Topology for the Dark Silicon Era

This paper proposes a reconfigurable network-on-chip that leverages the routers of the dark portion of the chip to customize the NoC topology for the active cores at any time, and reveals considerable reduction in energy consumption and latency of on-chip communication when compared to state-of-the art NoCs.

Making the On-Chip World Smaller with Low-Latency On-Chip Networks

This thesis proposes to make the on-chip world appear smaller by providing extremely low-latency networks that can make far away resources appear much closer by leveraging specially-engineered electrical wires that can transport data across chip at both high data rates and low latencies.

Hardware-software collaboration for dark silicon heterogeneous many-core systems

Multi-Gigabyte On-Chip DRAM Caches for Servers

It is demonstrated that if the cache is organized in pages, then page footprints are highly predictable using well-established code-correlation techniques, and therefore predicting access patterns within a page can eliminate most of the bandwidth overhead and capacity waste that page-based caches suffer from.

Memory Systems and Interconnects for Scale-Out Servers

This thesis seeks to architect on- chip interconnects and memory systems that are tuned for the requirements of memory-centric scale-out servers, and proposes specialized on-chip interconnecting systems that leverage common traffic characteristics, thereby improving server throughput and energy efficiency.

Multi-Gigabyte On-Chip DRAM Caches for Servers

It is demonstrated that if the cache is organized in pages, then page footprints are highly predictable using well-established code-correlation techniques, and therefore predicting access patterns within a page can eliminate most of the bandwidth overhead and capacity waste that page-based caches suffer from.

System and architecture level characterization of big data applications on big and little core server architectures

The characterization results across a wide range of real-world big data applications and various software stacks demonstrate how the choice of big vs little core-based server for energy-efficiency is significantly influenced by the size of data, performance constraints, and presence of accelerator.

Data sharing in multi-threaded applications and its impact on chip design

This work describes why data sharing behavior is hard to capture in an analytical model, and study why, and by how much, past attempts have fallen short, and proposes a new methodology to measure the impact of data sharing, which quantifies the reduction in on-chip cache miss rates attributable solely to the presence of datasharing.

Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache

This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors that eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency.

Learning to Design Accurate Deep Learning Accelerators with Inaccurate Multipliers

This work is the first to demonstrate the effectiveness of custom-tailored approximate circuits in delivering significant chip-level energy savings with zero accuracy loss on a large-scale dataset such as ImageNet.



Power Scaling: the Ultimate Obstacle to 1K-Core Chips

This work explores the design space of physically-constrained multicore chips across technologies and shows that, even with conservative estimates, chips will not scale beyond a few tens of cores due to physical power and off-chip bandwidth constraints, potentially leaving the die real-estate underutilized in future technology generations.

A Power-Efficient High-Throughput 32-Thread SPARC Processor

The first generation of Niagara SPARC processors implements a power-efficient multi-threading architecture to achieve high throughput with minimum hardware complexity. The design combines eight

An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

This paper proposes physical designs for these Non-Uniform Cache Architectures (NUCAs) and extends these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache.

3D-Stacked Memory Architectures for Multi-core Processors

  • Gabriel H. Loh
  • Computer Science
    2008 International Symposium on Computer Architecture
  • 2008
This work explores more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count, to achieve a 1.75x speedup over previously proposed 3D-DRAM approaches on memory-intensive multi-programmed workloads on a quad-core processor.

Process Variation Tolerant 3T1D-Based Cache Architectures

A range of cache refresh and placement schemes that are sensitive to retention time are proposed, and it is shown that most of the retention time variations can be masked by the microarchitecture when using these schemes.

Chip multiprocessors for server workloads

Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each access to place blocks close to the requesting cores is proposed, and it is found that heterogeneous multicores hold great promise in improving designs even further.

WiDGET: Wisconsin decoupled grid execution tiles

WiDGET's decoupled design provides flexibility to alter resource allocation for a particular power-performance target while turning off unallocated resources, and enables dynamic customization of different combinations of small and/or powerful cores on a single chip, consuming power in proportion to the delivered performance.

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).

Dark silicon and the end of multicore scaling

The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

The model considers the less-obvious relationships between conventional processors and a diverse set of U-cores to understand the relative merits between different approaches in the face of technology constraints, and supports speculation of future designs from scaling trends predicted by the ITRS road map.