Dark silicon and the end of multicore scaling

@article{Esmaeilzadeh2011DarkSA,
  title={Dark silicon and the end of multicore scaling},
  author={Hadi Esmaeilzadeh and Emily R. Blem and Ren{\'e}e St. Amant and Karthikeyan Sankaralingam and Doug Burger},
  journal={2011 38th Annual International Symposium on Computer Architecture (ISCA)},
  year={2011},
  pages={365-376}
}
  • H. Esmaeilzadeh, E. Blem, +2 authors D. Burger
  • Published 4 June 2011
  • Computer Science
  • 2011 38th Annual International Symposium on Computer Architecture (ISCA)
Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. [...] Key Method For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance.Expand
Voltage scaling and dark silicon in symmetric multicore processors
TLDR
This paper proposes high-performance and energy-efficient multicore architectures for variety of parallelisms and memory-intensities in workloads and uses dynamic voltage and frequency scaling in Amdahl’s law to decrease amount of dark silicon and improve performance and performance per watt/joule. Expand
Effect of voltage scaling on symmetric multicore's speed-up
TLDR
This paper proposes the best symmetric multicore model for variety of parallelism in workloads at different power budget constraints, and by using dynamic voltage and frequency scaling in Amdahl's law, attempts to decrease the amount of dark silicon and improve total performance. Expand
The TURBO Diaries: Application-controlled Frequency Scaling Explained
TLDR
A general-purpose library that allows selective control of DVFS from user space to accelerate multi-threaded applications and expose the potential of heterogeneous frequencies is proposed. Expand
PARALLELISM-ENERGY PERFORMANCE ANALYSIS OF MULTICORE SYSTEMS
TLDR
It is shown that balancing system resources is key to reducing the energy usage of an application, and this is achieved by improving the hardware performance, rather than by lowering the power usage. Expand
Looking back and looking forward: power, performance, and upheaval
TLDR
Diverse findings suggest that diverse application power profiles suggest that future applications and system software will need to participate in power optimization and management; and software and hardware researchers need access to real measurements to optimize for power and energy. Expand
Metrics for Early-Stage Modeling of Many-Accelerator Architectures
TLDR
It is found that the architecture selected by the communication-aware metric shows improved performance over architectures selected based on execution time and Pollack's rule, as they do not account for speedup being limited by communication. Expand
Dark vs. Dim Silicon and Near-Threshold Computing
TLDR
An analytical framework called Lumos is developed to quantify the performance limits of many-core, heterogeneous systems operating at near-threshold voltage and shows that dim cores do indeed boost throughput, even in the presence of process variations. Expand
An Investigation of Power-Performance Aware Accelerator/Core Allocation Challenges in Dark Silicon Heterogeneous Systems
TLDR
This paper focuses on identifying and highlighting some of the critical challenges faced due to Dark Silicon, and lists down some initial research efforts to tackle these issues. Expand
Achieving Superscalar Performance without Superscalar Overheads - A Dataflow Compiler IR for Custom Computing
TLDR
This paper addresses the problem of improving sequential performance in custom hardware by switching from a statically scheduled to a dynamically scheduled (dataflow) execution model, and developing a new compiler IR for high-level synthesis that enables aggressive exposition of ILP even in the presence of complex control flow. Expand
Collective memory transfers for multi-core chips
TLDR
This paper proposes collective memory scheduling (CMS) that uses simple software and inexpensive hardware to identify collective transfers and guarantee that loads and stores arrive in memory address order to the memory controller. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 39 REFERENCES
Power Scaling: the Ultimate Obstacle to 1K-Core Chips
TLDR
This work explores the design space of physically-constrained multicore chips across technologies and shows that, even with conservative estimates, chips will not scale beyond a few tens of cores due to physical power and off-chip bandwidth constraints, potentially leaving the die real-estate underutilized in future technology generations. Expand
The Cost of Uncore in Throughput-Oriented Many-Core Processors
Achieving performance through traditional techniques such as extracting more instruction level parallelism or increasing clock frequencies are losing their effectiveness due to the power wall.Expand
Understanding PARSEC performance on contemporary CMPs
TLDR
This work finds new Chip Multiprocessor (CMP) designs to be largely compute-bound, and thus limited by number of cores, micro-architectural resources, and cache-to-cache transfers, rather than by off-chip memory or system bus bandwidth. Expand
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?
TLDR
The model considers the less-obvious relationships between conventional processors and a diverse set of U-cores to understand the relative merits between different approaches in the face of technology constraints, and supports speculation of future designs from scaling trends predicted by the ITRS road map. Expand
Looking back on the language and hardware revolutions: measured power, performance, and scaling
TLDR
This paper reports and analyzes measured chip power and performance on five process technology generations executing 61 diverse benchmarks with a rigorous methodology, revealing the extent of some known and previously unobserved hardware and software trends. Expand
Conservation cores: reducing the energy of mature computations
TLDR
A toolchain for automatically synthesizing c-cores from application source code is presented and it is demonstrated that they can significantly reduce energy and energy-delay for a wide range of applications, and patching can extend the useful lifetime of individual c-Cores to match that of conventional processors. Expand
Conservation cores: reducing the energy of mature computations
TLDR
A toolchain for automatically synthesizing c-cores from application source code is presented and it is demonstrated that they can significantly reduce energy and energy-delay for a wide range of applications, and patching can extend the useful lifetime of individual c-Cores to match that of conventional processors. Expand
Analyzing CUDA workloads using a detailed GPU simulator
TLDR
Two observations are made that for the applications the authors study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system. Expand
Over-provisioned multicore systems
Technology scaling has provided system designers with an exploding transistor budget, far more than what was available when the core principles behind many existing commodity microprocessors wereExpand
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
TLDR
This paper discusses optimization techniques for both CPU and GPU, analyzes what architecture features contributed to performance differences between the two architectures, and recommends a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels. Expand
...
1
2
3
4
...