Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems

@article{Hackenberg2009ComparingCA,
  title={Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems},
  author={Daniel Hackenberg and Daniel Molka and Wolfgang E. Nagel},
  journal={2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)},
  year={2009},
  pages={413-422}
}
Across a broad range of applications, multicore technology is the most important factor that drives today's microprocessor performance improvements. Closely coupled is a growing complexity of the memory subsystems with several cache levels that need to be exploited efficiently to gain optimal application performance. Many important implementation details of these memory subsystems are undocumented. We therefore present a set of sophisticated benchmarks for latency and bandwidth measurements to… 

Figures and Tables from this paper

Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture

This work has developed sophisticated benchmarks that allow for in-depth investigations with full memory location and coherence state control of the Intel Has well-EP micro-architecture, including important memory latency and bandwidth characteristics as well as the cost of core-to-core transfers.

Main memory and cache performance of intel sandy bridge and AMD bulldozer

This work tackles the important aspect of measuring and understanding undocumented memory performance numbers in order to create valuable insight into microprocessor details and builds upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem.

wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

This paper presents a comprehensive study to evaluate cache architecture design on three representative ARMv8 multi-cores, Phytium 2000+, ThunderX2, and Kunpeng 920 and develops the wrBench, a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communications.

Memory Performance and SPEC OpenMP Scalability on Quad-Socket x86_64 Systems

This paper uses low-level microbenchmarks to compare two state-of-the-art quad-socket systems with x86 64 processors from AMD and Intel and investigates the performance of the application based OpenMP benchmark suite SPEC OMPM2001, and shows how scalability correlates with the previously determined characteristics of the memory hierarchy.

Performance Analysis of Complex Shared Memory Systems Abridged Version of Dissertation

The properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed.

Cache Line Aware Algorithm Design for Cache-Coherent Architectures

This work designs a simple interface for cache line aware optimization, a translation methodology, and a full performance model that exposes the block-based design of caches to middleware designers and uses mathematical optimization techniques to tune synchronization algorithms to the microarchitectures.

Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

This work presents a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem, and describes two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention.

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform comprising 288 cores through the use of a joint hardware/software approach and demonstrates the effectiveness of this joint approach.

Studies on the Impact of Cache Configuration on Multicore Processor

The effect of interconnect on the performance of multicore processors has been analyzed and a novel scalable on-chip interconnection mechanism (INoC) for multicore processor has been proposed and results shows that, using the proposed INoC, execution time can be significantly reduced, compared with MPIN.

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

  • Sabela RamosT. Hoefler
  • Computer Science
    2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • 2017
This work provides an extensive model of all memory configuration options for Xeon Phi KNL and demonstrates how it can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel’s tuned OpenMP and MPI implementations, respectively.
...

References

SHOWING 1-10 OF 14 REFERENCES

The AMD Opteron Northbridge Architecture

To increase performance while operating within a fixed power budget, the AMD opteron processor integrates multiple times86-64 cores with a router and memory controller. AMD's experience with building

Computer Architecture, Fifth Edition: A Quantitative Approach

The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.

An Integrated Quad-Core Opteron Processor

  • J. DorseyS. Searles R. Kumar
  • Computer Science
    2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers
  • 2007
An integrated quad-core times86 processor is implemented in a 65nm 11M SOI CMOS process based on an enhanced Opterontrade core and the SRAM cache designs target process variation considerations and future process scalability.

Computer Architecture: A Quantitative Approach

This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important

Investigating Cache Parameters of x86 Family Processors

Experiments are presented that investigate detailed parameters of the memory architecture, focusing on such information that is typically not available elsewhere, to remedy the lack of information.

SPEC CPU2006 published results page: http://www.spec.org/cpu2006/results

  • SPEC CPU2006 published results page: http://www.spec.org/cpu2006/results

Intel 64 and IA-32 Architectures Optimization Reference Manual

  • Intel 64 and IA-32 Architectures Optimization Reference Manual
  • 2009

Memory bandwidth and machine balance in current high performance computers

  • IEEE Computer Society Technical Committee on Computer Architecture Newsletter
  • 1995