• Corpus ID: 232424665

NUMAscope: Capturing and Visualizing Hardware Metrics on Large ccNUMA Systems

  title={NUMAscope: Capturing and Visualizing Hardware Metrics on Large ccNUMA Systems},
  author={Daniel J. Blueman and Foivos S. Zakkak and Christos Kotselidis},
Cache-coherent non-uniform memory access (ccNUMA) systems enable parallel applications to scale-up to thousands of cores and many terabytes of main memory. However, since remote accesses come at an increased cost, extra measures are necessitated to scale the applications to high core-counts and process far greater amounts of data than a typical server can hold. In a similar manner to how applications are optimized to improve cache utilization, applications also need to be optimized to improve… 

Figures from this paper



Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms

  • Collin McCurdyJ. Vetter
  • Computer Science
    2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS)
  • 2010
It is demonstrated that NUMA can indeed be a significant problem for scientific applications, showing that it can mean the difference between an application scaling perfectly and failing to scale at all.

A tool to analyze the performance of multithreaded programs on NUMA architectures

The design and implementation of extensions to HPCToolkit to support measurement and analysis of performance problems on multiprocessor systems with multiple NUMA domains are described and their utility is demonstrated by describing case studies in which these capabilities are used to diagnose N UMA bottlenecks in four multithreaded applications.

MemProf: A Memory Profiler for NUMA Multicore Systems

MemProf is presented, a profiler that allows programmers to choose and implement efficient application-level optimizations for NUMA systems and builds temporal flows of interactions between threads and objects, which help programmers understand why and which memory objects are accessed remotely.

ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs

  • Xu LiuBo Wu
  • Computer Science
    SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2015
ScaAnalyzer provides high-level, detailed guidance to programmers for scalability optimization and identifies scalability bottlenecks caused by poor memory access behaviors and provides optimization guidance that yields significant improvement in scalability.

Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems

A set of sophisticated benchmarks for latency and bandwidth measurements to arbitrary locations in the memory subsystem are presented and the coherency state of cache lines are considered to analyze the cache co herency protocols and their performance impact.

numap: A portable library for low-level memory profiling

  • M. SelvaL. MorelK. Marquet
  • Computer Science
    2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS)
  • 2016
This work presents numap, a library dedicated to the profiling of the memory sub-system of modern multi-core architectures, which is portable across many micro-architectures and comes with a clean application programming interface allowing to easily build profiling tools on top of it.

Collecting Performance Data with PAPI-C

The evolution of PAPI is discussed into Component PAPI, or PAPI-C, in which multiple sources of performance data can be measured simultaneously via a common software interface, and the challenges to hardware performance measurement in existing multi-core architectures are explored.

An application-centric ccNUMA memory profiler

  • U. PrestorA. Davis
  • Computer Science
    Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538)
  • 2001
A new memory profiling tool called snperf is presented which does provide high fidelity information about application memory performance issues for SGI Origin systems and helps clarify the role of memory and task allocation inCache coherent shared memory multiprocessors.

A Top-Down method for performance analysis and counters architecture

  • Ahmad Yasin
  • Computer Science
    2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
  • 2014
This analysis method is low-cost and already featured in in-production systems - it requires just eight simple new performance events to be added to a traditional PMU, and accounts for granular bottlenecks in super-scalar cores, missed by earlier approaches.

In-Network Coherence Filtering: Snoopy coherence without broadcasts

  • Niket AgarwalL. PehN. Jha
  • Computer Science
    2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
  • 2009
This work proposes embedding small in-network coherence filters inside on-chip routers that dynamically track sharing patterns among various cores that are used to filter away redundant snoop requests that are traveling towards unshared cores.