Main memory and cache performance of intel sandy bridge and AMD bulldozer

@article{Molka2014MainMA,
  title={Main memory and cache performance of intel sandy bridge and AMD bulldozer},
  author={Daniel Molka and Daniel Hackenberg and Robert Sch{\"o}ne},
  journal={Proceedings of the workshop on Memory Systems Performance and Correctness},
  year={2014}
}
Application performance on multicore processors is seldom constrained by the speed of floating point or integer units. [] Key Method For this, we build upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem. These benchmarks are extended to support AVX instructions for bandwidth measurements and to integrate the coherence states (O)wned and (F)orward. We then use these benchmarks to perform an indepth analysis of current ccNUMA…

Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors

The findings illustrate the complex NUMA properties and how data placement and cache coherence states impact access latencies to local and remote locations and compares theoretical and effective bandwidths for accessing data at the different memory levels and main memory bandwidth saturation at reduced core counts.

Understanding the Impact of Memory Access Patterns in Intel Processors

This study investigates the interplay between Intel processors' memory hierarchy and different memory access patterns in applications with the objective of predicting LLC-dynamic random access memory (DRAM) traffic for a given application in given Intel architectures.

Performance Analysis of Complex Shared Memory Systems Abridged Version of Dissertation

The properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed.

Understanding HPC Benchmark Performance on Intel Broadwell and Cascade Lake Processors

This paper investigates two modern Intel x86 server CPU architectures in depth: Broadwell EP and Cascade Lake SP and highlights relevant hardware configuration settings that can have a decisive impact on code performance and shows how to properly measure on-chip and off-chip data transfer bandwidths.

Detecting Memory-Boundedness with Hardware Performance Counters

This paper evaluates whether hardware performance counters can be used to measure the capacity utilization within the memory hierarchy and estimate the impact of memory accesses on the achieved performance and investigates which stall counters provide good estimates for the number of cycles that are actually spent waiting for theMemory hierarchy.

Analysis of Memory System of Tiled Many-Core Processors

Novel models regarding the differentiation between uniform memory access and NUMA on tiled many-core processors from the perspective of the cache system are defined to facilitate OS designers and application programmers in fully understanding the underlying hardware.

Machine-aware memory allocation and synchronization

This thesis proposes the use highly tuned multicast trees as a building block for higher-level applications if made machine-aware, high level protocols built on top of these trees then directly benefit from optimizations applied on the lower level.

Contention-aware application performance prediction for disaggregated memory systems

This paper proposes a generic approach to predict the performance degradation due to sharing of disaggregated memory, and shows that the methodology predicts the slowdown in application performance subject to memory contention with an average error and max error.

Performance Analysis of Complex Shared Memory Systems

It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations.
...

Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems

A set of sophisticated benchmarks for latency and bandwidth measurements to arbitrary locations in the memory subsystem are presented and the coherency state of cache lines are considered to analyze the cache co herency protocols and their performance impact.

Measuring Memory Hierarchy Performance of Cache-Coherent Multiprocessors Using Micro Benchmarks

A micro benchmark suite is introduced that measures memory hierarchy performance in light of both uniprocessor optimizations and the contention and coherence effects of multiprocessors.

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System

This paper presents fundamental details of the newly introduced Intel Nehalem microarchitecture with its integrated memory controller, Quick Path Interconnect, and ccNUMA architecture, based on sophisticated benchmarks to measure the latency and bandwidth between different locations in the memory subsystem.

Memory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86_64 Processors

A detailed analysis of memory bandwidth scaling at different concurrency levels on the latest generation of x86-64 compute nodes shows that memory and last level cache bandwidths at reduced clock speeds strongly depend on the processor microarchitecture.

LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments

This work shows the influence of thread pinning on performance using the well-known OpenMP STREAM triad benchmark, and uses the affinity and hardware counter tools to study the performance of a stencil code specifically optimized to utilize shared caches on multicore chips.

LIKWID: Lightweight Performance Tools

This work shows the influence of thread affinity on performance using the well-known OpenMP STREAM triad benchmark, uses hardware counter tools to study the performance of a stencil code, and shows how to detect bandwidth problems on ccNUMA-based compute nodes.

Test-driving Intel Xeon Phi

The experience indicates that a simple data structure and massive parallelism are critical for Xeon Phi to perform well, and when compiler-driven parallelization and/or vectorization fails, programming Xeon Phi for performance can become very challenging.

Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor

The 12-core AMD Opteron processor, code-named "Magny Cours," combines advances in silicon, packaging, interconnect, cache coherence protocol, and server architecture to increase the compute density

NUMA-aware shared-memory collective communication for MPI

The design and optimizations of MPI collectives for clusters of NUMA nodes are investigated, performance models for collective communication using shared memory are developed, and several algorithms for various collectives are developed.

Prediction models for multi-dimensional power-performance optimization on many cores

A multi-dimensional, online performance predictor is presented, which is deployed to address the problem of simultaneous runtime optimization of DVFS and DCT on multi-core systems and outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS.