Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems
@article{Hackenberg2009ComparingCA, title={Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems}, author={Daniel Hackenberg and Daniel Molka and Wolfgang E. Nagel}, journal={2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)}, year={2009}, pages={413-422} }
Across a broad range of applications, multicore technology is the most important factor that drives today's microprocessor performance improvements. Closely coupled is a growing complexity of the memory subsystems with several cache levels that need to be exploited efficiently to gain optimal application performance. Many important implementation details of these memory subsystems are undocumented. We therefore present a set of sophisticated benchmarks for latency and bandwidth measurements to…
Figures and Tables from this paper
137 Citations
Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture
- Computer Science2015 44th International Conference on Parallel Processing
- 2015
This work has developed sophisticated benchmarks that allow for in-depth investigations with full memory location and coherence state control of the Intel Has well-EP micro-architecture, including important memory latency and bandwidth characteristics as well as the cost of core-to-core transfers.
Main memory and cache performance of intel sandy bridge and AMD bulldozer
- Computer ScienceMSPC@PLDI
- 2014
This work tackles the important aspect of measuring and understanding undocumented memory performance numbers in order to create valuable insight into microprocessor details and builds upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem.
wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems
- Computer Science
- 2021
This paper presents a comprehensive study to evaluate cache architecture design on three representative ARMv8 multi-cores, Phytium 2000+, ThunderX2, and Kunpeng 920 and develops the wrBench, a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communications.
Memory Performance and SPEC OpenMP Scalability on Quad-Socket x86_64 Systems
- Computer ScienceICA3PP
- 2011
This paper uses low-level microbenchmarks to compare two state-of-the-art quad-socket systems with x86 64 processors from AMD and Intel and investigates the performance of the application based OpenMP benchmark suite SPEC OMPM2001, and shows how scalability correlates with the previously determined characteristics of the memory hierarchy.
Performance Analysis of Complex Shared Memory Systems Abridged Version of Dissertation
- Computer Science
- 2016
The properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed.
Cache Line Aware Algorithm Design for Cache-Coherent Architectures
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2016
This work designs a simple interface for cache line aware optimization, a translation methodology, and a full performance model that exposes the block-based design of caches to middleware designers and uses mathematical optimization techniques to tune synchronization algorithms to the microarchitectures.
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
- Computer ScienceISMM '11
- 2011
This work presents a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem, and describes two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention.
Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2018
This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform comprising 288 cores through the use of a joint hardware/software approach and demonstrates the effectiveness of this joint approach.
Studies on the Impact of Cache Configuration on Multicore Processor
- Computer Science
- 2014
The effect of interconnect on the performance of multicore processors has been analyzed and a novel scalable on-chip interconnection mechanism (INoC) for multicore processor has been proposed and results shows that, using the proposed INoC, execution time can be significantly reduced, compared with MPIN.
Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL
- Computer Science2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2017
This work provides an extensive model of all memory configuration options for Xeon Phi KNL and demonstrates how it can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel’s tuned OpenMP and MPI implementations, respectively.
References
SHOWING 1-10 OF 14 REFERENCES
Memory hierarchy performance measurement of commercial dual-core desktop processors
- Computer ScienceJ. Syst. Archit.
- 2008
The AMD Opteron Northbridge Architecture
- Computer ScienceIEEE Micro
- 2007
To increase performance while operating within a fixed power budget, the AMD opteron processor integrates multiple times86-64 cores with a router and memory controller. AMD's experience with building…
BenchIT - Performance Measurements and Comparison for Scientific Applications
- Computer SciencePARCO
- 2003
Computer Architecture, Fifth Edition: A Quantitative Approach
- Computer Science
- 2011
The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.
An Integrated Quad-Core Opteron Processor
- Computer Science2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers
- 2007
An integrated quad-core times86 processor is implemented in a 65nm 11M SOI CMOS process based on an enhanced Opterontrade core and the SRAM cache designs target process variation considerations and future process scalability.
Computer Architecture: A Quantitative Approach
- Computer Science
- 1969
This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important…
Investigating Cache Parameters of x86 Family Processors
- Computer ScienceSPEC Benchmark Workshop
- 2009
Experiments are presented that investigate detailed parameters of the memory architecture, focusing on such information that is typically not available elsewhere, to remedy the lack of information.
SPEC CPU2006 published results page: http://www.spec.org/cpu2006/results
- SPEC CPU2006 published results page: http://www.spec.org/cpu2006/results
Intel 64 and IA-32 Architectures Optimization Reference Manual
- Intel 64 and IA-32 Architectures Optimization Reference Manual
- 2009
Memory bandwidth and machine balance in current high performance computers
- IEEE Computer Society Technical Committee on Computer Architecture Newsletter
- 1995