Tiptop: Hardware Performance Counters for the Masses

  title={Tiptop: Hardware Performance Counters for the Masses},
  author={Erven Rohou},
  journal={2012 41st International Conference on Parallel Processing Workshops},
  • Erven Rohou
  • Published 10 September 2012
  • Computer Science
  • 2012 41st International Conference on Parallel Processing Workshops
Hardware performance monitoring counters have recently received a lot of attention. They have been used by diverse communities to understand and improve the quality of computing systems: for example, architects use them to extract application characteristics and propose new hardware mechanisms, compiler writers study how generated code behaves on particular hardware, software developers identify critical regions of their applications and evaluate design choices to select the best performing… 

Figures and Tables from this paper

Sequential Performance: Raising Awareness of the Gory Details
Branch prediction and the performance of interpreters — Don't trust folklore
It is shown that the accuracy of indirect branch prediction is no longer critical for interpreters, and the characteristics of these interpreters are compared, and why the indirect branch is less important than before is analyzed.
SoK: The Challenges, Pitfalls, and Perils of Using Hardware Performance Counters for Security
A year-long effort to study the best practices for obtaining accurate measurement of events using performance counters, understand the challenges and pitfalls of using HPCs in various settings, and explore ways to obtain consistent and accurate measurements across different settings and architectures, and empirically evaluated how failure to accommodate for various subtleties in the use of HPS can undermine the effectiveness of security applications.
PADRONE: a Platform for Online Profiling, Analysis, and Optimization
This work describes the infrastructure of PADRONE, and shows that its profiling overhead is minimal, and believes PADrONE fits an empty design point in the ecosystem of dynamic binary tools.
A Topology-Aware Performance Monitoring Tool for Shared Resource Management in Multicore Systems
A new way to analyze performance by crossing the roads of performance monitoring and topology-aware placement is introduced, and an extension of the Hardware Locality software (hwloc) is proposed that enhances its graphical capabilities.
Infrastructures and Compilation Strategies for the Performance of Computing Systems
This document presents our main contributions to the field of compilation, and more generally to the quest of performance of computing systems. It is structured by type of execution environment,
Contention Aware Scheduler with Accurate Memory Bandwidth Measurement for Predictable Multicore Software
This paper is using hardware performance counters to continuously track the memory bandwidth consumed by different applications executing in parallel and proposes a contention aware scheduler for predictable multi core software to reduce contention and to improve the performance of the system.
An empirical high level performance model for future many-cores
A more refined but still tractable, high level empirical performance model for multi-threaded applications, the Serial/Parallel Scaling (SPS) Model to study the scalability and performance of application in many-core era is proposed.
Monitoring computer systems for crypto mining threat detection
The tool and approach developed, aimed at capturing both general and detailed system parameters, with a specific focus on detecting malware that mines virtual currencies, are described.
Parallelism and distribution for very large scale content-based image retrieval
This thesis describes a high-dimensional indexing technique called eCP, which builds on an existing indexing scheme that is main memory oriented, and proposes multi-threaded algorithms for both building and searching, harnessing the power of multi-core processors.


Can hardware performance counters be trusted?
The behavior of the SPEC benchmarks with both dynamic binary instrumentation (DBI) tools and hardware counters is explored, and it is found that minor changes to the experimental setup reduce observed errors to less than 0.002% for all benchmarks.
A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters
The PAPI project is developing end-user tools for dynamically selecting and displaying hardware counter performance data and proposed a standard set of hardware events and a standard cross-platform library interface to the underlying counter hardware.
Investigating the impact of code generation on performance characteristics of integer programs
This study uses the quad-core AMD Opteron processors and the SPEC CPU 2006 Integer benchmark to evaluate how various micro-architecture performance metrics are sensitive to three top-performing compilers.
Demand-driven software race detection using hardware performance counters
This paper is able to observe cache events that are indicative of data sharing between threads by taking advantage of hardware available on modern commercial microprocessors and uses these to build a race detector that is only enabled when it is likely that inter-thread data sharing is occurring.
Rapidly Selecting Good Compiler Optimizations using Performance Counters
This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings by learning a model off-line which can then be used to determine good settings for any new program.
Performance analysis of idle programs
The design and methodology for WAIT, a tool to diagnosis the root cause of idle time in server applications, are presented and a simple expert system based on an extensible set of declarative rules is presented.
MAO — An extensible micro-architectural optimizer
MAO, an extensible micro-architectural assembly to assembly optimizer, is presented, which seeks to address this problem for x86/64 processors and can be integrated into any compiler that emits assembly code, or used standalone.
Pin: building customized program analysis tools with dynamic instrumentation
The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Phase tracking and prediction
This paper presents a unified profiling architecture that can efficiently capture, classify, and predict phase-based program behavior on the largest of time scales, and can capture phases that account for over 80% of execution using less that 500 bytes of on-chip memory.
Making Sense of Performance Counter Measurements on Supercomputing Applications
Primary performance bottlenecks unique to multicore chips are described, sketching the roles that several commonly used measurement tools can most effectively play in performance optimization and a novel high level multicore optimization technique is described that increased performance up to 35%.