B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors

@article{Kadjo2014BFetchBP,
  title={B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors},
  author={David Kadjo and Jinchun Kim and Prabal Sharma and Reena Panda and Paul V. Gratz and Daniel A. Jim{\'e}nez},
  journal={2014 47th Annual IEEE/ACM International Symposium on Microarchitecture},
  year={2014},
  pages={623-634}
}
For decades, the primary tools in alleviating the "Memory Wall" have been large cache hierarchies and data prefetchers. Both approaches, become more challenging in modern, Chip-multiprocessor (CMP) design. Increasing the last-level cache (LLC) size yields diminishing returns in terms of performance per Watt; given VLSI power scaling trends, this approach becomes hard to justify. These trends also impact hardware budgets for prefetchers. Moreover, in the context of CMPs running multiple… CONTINUE READING

Figures, Tables, Results, and Topics from this paper.

Key Quantitative Results

  • Detailed simulation using a cycle accurate simulator shows a geometric mean speedup of 23.4% for single-threaded workloads, improving to 28.6% for multi-application workloads over a baseline system without prefetching. We find that B-Fetch outperforms an existing "best-of-class" light-weight prefetcher under single-threaded and multi programmed workloads by 9% on average, with 65% less storage overhead.
  • Detailed simulation using a cycle accurate simulator shows a geometric mean speedup of 23.4% for single-threaded workloads, improv­ing to 28.6% for multi-application workloads over a baseline system without prefetching.
  • that B-Fetch outperforms an existing best-of-class light-weight prefetcher under single­threaded and multiprogrammed workloads by 9% on average, with 65% less storage overhead.
  • -We show that B-Fetch outperforms the best-in-class light-weight prefetcher, Spatial Memory Streaming (SMS) [23] by 3.5% for single-threaded workloads (8.5% among prefetch sensitive), and up to 8.9% for multi-application workloads, with 65% less storage overhead than SMS. We evaluate our technique on a set of SPEC CPU2006 benchmarks for both single threaded and multiprogrammed workloads and show performance improvement of 23.4% -31.2% on average versus baseline.

Citations

Publications citing this paper.
SHOWING 1-10 OF 22 CITATIONS

Bootstrapping: Using SMT Hardware to Improve Single-Thread Performance

  • IEEE Computer Architecture Letters
  • 2018
VIEW 5 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

MTB-Fetch: Multithreading Aware Hardware Prefetching for Chip Multiprocessors

  • IEEE Computer Architecture Letters
  • 2018
VIEW 6 EXCERPTS
CITES BACKGROUND & METHODS
HIGHLY INFLUENCED

Path confidence based lookahead prefetching

  • 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
  • 2016
VIEW 6 EXCERPTS
CITES BACKGROUND & METHODS
HIGHLY INFLUENCED

Self-contained, accurate precomputation prefetching

  • 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
  • 2015
VIEW 4 EXCERPTS
CITES BACKGROUND
HIGHLY INFLUENCED

Bingo Spatial Data Prefetcher

  • 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • 2019
VIEW 1 EXCERPT
CITES METHODS

Multi-Lookahead Offset Prefetching

Mehran Shakerinava, Mohammad Bakhshalipour, Pejman Lotfi-Kamran, Hamid Sarbazi-Azad
  • 2019
VIEW 1 EXCERPT
CITES BACKGROUND

Similar Papers

Loading similar papers…