• Corpus ID: 9853321

Adaptive and application dependent runtime guided hardware prefetcher reconfiguration on the IBM POWER7

  title={Adaptive and application dependent runtime guided hardware prefetcher reconfiguration on the IBM POWER7},
  author={David Prat and Cristobal Ortega and Marc Casas and Miquel Moret{\'o} and Mateo Valero},
Hardware data prefetcher engines have been extensively used to reduce the impact of memory latency. However, microprocessors' hardware prefetcher engines do not include any automatic hardware control able to dynamically tune their operation. This lacking architectural feature causes systems to operate with prefetchers in a fixed configuration, which in many cases harms performance and energy consumption. In this paper, a piece of software that solves the discussed problem in the context of the… 

Figures and Tables from this paper

Data Prefetching on In-order Processors

It is shown that next-line prefetching can achieve nearly top performance with a reasonable bandwidth consumption when throttled, whilst neighbor prefetchers have been found to be best, overall, overall.

libPRISM: an intelligent adaptation of prefetch and SMT levels

Current microprocessors include several knobs to modify the hardware behavior in order to improve performance under different workload demands. An impractical and time consuming offline profiling is

A Brief Overview on Runtime-Aware Architectures

When uniprocessors were the norm, Instruction Level Parallelism (ILP) and Data Level parallelism (DLP) were widely exploited to increase the number of instructions executed per cycle to exploit the locality that many programs have.

Intelligent Adaptation of Hardware Knobs for Improving Performance and Power Consumption

Current microprocessors include several knobs to modify the hardware behavior in order to improve performance, power, and energy under different workload demands. An impractical and time consuming

Power-constrained aware and latency-aware microarchitectural optimizations in many-core processors

Tag Cache mechanism, an on-chip distributed tag caching mechanism with limited space and latency overhead to bypass the tag read operation in multi-way DRAM Caches, thereby reducing hit latency is proposed.



Making data prefetch smarter: Adaptive prefetching on POWER7

An adaptive prefetch scheme that dynamically modifies the prefetch settings in order to adapt to the workload requirements is presented and implemented in the context of an existing, commercial processor, namely the IBM POWER7.

Prefetching Using Markov Predictors

The Markov prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.

Using a user-level memory thread for correlation prefetching

This paper introduces the idea of using a user-level memory thread (ULMT) for correlation prefetching, and shows that the scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46.

Characterization and dynamic mitigation of intra-application cache interference

  • Carole-Jean WuM. Martonosi
  • Computer Science
  • 2011
This paper characterizes the degree by which intra-application interference factors such as page table walks and hardware prefetching influence performance and proposes dynamic management methods to reduce intra- application interference.

Software-Controlled Priority Characterization of POWER5 Processor

It is shown that by prioritizing the right task, it is possible to obtain more than two times of throughput improvement for synthetic workloads compared to the baseline, and the circumstances when a background thread can be run transparently without affecting the performance of the foreground thread.

Machine learning-based prefetch optimization for data center applications

A tuning framework is developed which attempts to predict the optimal configuration based on hardware performance counters and achieves performance within 1% of the best performance of any single configuration for the same set of applications.

Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems

The proposed solution has two new components: a compiler-guided prefetch filtering mechanism that informs the hardware about which pointer addresses to prefetch, and a coordinated prefetcher throttling mechanism that uses run-time feedback to manage the interference between multiple prefetchers (LDS and stream-based) in a hybrid prefetching system.

Predictable performance in SMT processors: synergy between the OS and SMTs

This paper proposes a novel strategy that enables a two-way interaction between the OS and the SMT processor and allows the OS to run jobs at a certain percentage of their maximum speed, regardless of the workload in which these jobs are executed.

Evaluating stream buffers as a secondary cache replacement

The authors evaluate a memory system design that can be both cost-effective as well as provide better performance, particularly for scientific workloads: a single level of (on-chip) cache backed up only by Jouppi's stream buffers and a main memory.

A prefetch taxonomy

A new, accurate, and complete taxonomy is introduced, called the Prefetch Traffic and Miss Taxonomy (PTMT), for classifying each prefetch by precisely accounting for the difference in traffic and misses it generates, either directly or indirectly.