# Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems

@article{Lin2008GainingII,
title={Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems},
author={Jiang Lin and Qingda Lu and Xiaoning Ding and Zhao Zhang and Xiaodong Zhang and P. Sadayappan},
journal={2008 IEEE 14th International Symposium on High Performance Computer Architecture},
year={2008},
pages={367-378}
}
• Published 24 October 2008
• Computer Science
• 2008 IEEE 14th International Symposium on High Performance Computer Architecture
Cache partitioning and sharing is critical to the effective utilization of multicore processors. However, almost all existing studies have been evaluated by simulation that often has several limitations, such as excessive simulation time, absence of OS activities and proneness to simulation inaccuracy. To address these issues, we have taken an efficient software approach to supporting both static and dynamic cache partitioning in OS through memory address mapping. We have comprehensively…
388 Citations

## Figures and Tables from this paper

When Partitioning Works and When It Doesn't: An Empirical Study on Cache Way Partitioning
The impact of cache configurations, memory characteristic of program, and partitioning variation to the performance gain under partitioning is investigated to help in future cache system design and optimization for cloud data centers.
An experimental evaluation of the cache partitioning impact on multicore real-time schedulers
• Computer Science
2013 IEEE 19th International Conference on Embedded and Real-Time Computing Systems and Applications
• 2013
The design and implement of a shared cache partitioning mechanism in a multicore component-based RTOS capable of assigning partitions to internal OS data structures are designed and implemented, and the results indicate that a lightweight RTOS does not impact real-time tasks, and shared Cache partitioning has different behavior depending on the scheduler and the task's working set size.
Intra-application cache partitioning
• Computer Science
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
• 2010
A dynamic, runtime system based, cache partitioning scheme that partitions the shared cache space dynamically among the individual threads of a given application, and shows that speeding up the critical path thread this way results in overall performance enhancement of the application execution in the long term.
Enabling software management for multicore caches with a lightweight hardware support
• Computer Science
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
• 2009
This work proposes to provide an affordable and lightweight hardware support to coordinate with OS-based cache management policies that are scalable to many-cores, and perform comparably with other proposed hardware solutions, but have much lower overheads, therefore can be easily adopted in commodity processors.
On the Energy Efficiency of Last-Level Cache Partitioning
• Computer Science
• 2012
It is found that for modern, multi-threaded benchmarks there are only a limited number of application pairings where cache partitioning is more effective than naive cache sharing at reducing energy in a race-to-halt scenario, but in contexts where a constant stream of background work is available, a dynamically adaptive cache partitions policy is effective at increasing background application throughput while preserving foreground application performance.
Coordinated Cache Management for Predictable Multi-Core Real-Time Systems
• Computer Science
• 2013
This paper proposes a practical OS-level cache management scheme for multi-core real-time systems that provides predictable cache performance, addresses the aforementioned problems of existing software cache partitioning, and efficiently allocates cache partitions to schedule a given taskset.
Evaluating the Isolation Effect of Cache Partitioning on COTS Multicore Platforms
It is found that cache partitioning does not necessarily eliminate interference in accessing the LLC, even when the concerned task only accesses its dedicated cache partition, and up to 14X slowdown is observed in such a configuration.
Makespan-Optimal Cache Partitioning
• Pan Lai, Rui Fan
• Computer Science
2013 IEEE 21st International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems
• 2013
This work introduces the problem of determining the optimal cache partitioning to minimize the make span for completing a set of tasks, and presents an algorithm that finds a 1 + Epsilon approximation to the optimal partitioning in O(n log \frac{n}{\epsilon}log\frac{ n}{\EPsilon p}) time.
Improving Cache Partitioning Algorithms for Pseudo-LRU Policies
• Computer Science
IEICE Trans. Inf. Syst.
• 2013
This work proposes a cache partitioning mechanism for two popular pseudo-LRU policies: Not Recently Used (NRU) and Binary Tree (BT) without the help of true LRU’s stack property, and proposes a profiling logic that applies curve approximation methods to derive the hit curve.
Modeling performance variation due to cache sharing
• Computer Science
2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
• 2013
This paper introduces a method for efficiently investigating the performance variability due to cache contention that can estimate an application pair's performance variation 213× faster, on average, than native execution and can predict application slowdown with an average relative error.

## References

SHOWING 1-10 OF 22 REFERENCES
Cooperative cache partitioning for chip multiprocessors
• Computer Science
ICS '07
• 2007
For workloads that can benefit from cache partitioning, CCP achieves up to 60%, and on average 12%, better performance than the exhaustive search of optimal static partitions, and provides the best results on almost all evaluation metrics for different cache sizes.
Fair cache sharing and partitioning in a chip multiprocessor architecture
• Computer Science
Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004.
• 2004
It is found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness, and two algorithms are proposed that optimize fairness.
CQoS: a framework for enabling QoS in shared caches of CMP platforms
A new cache management framework (CQoS) that recognizes the heterogeneity in memory access streams, introduces the notion of QoS to handle the varying degrees of locality and latency sensitivity and assigns and enforces priorities to streams based on latency sensitivity, locality degree and application performance needs is presented.
Dynamic Partitioning of Shared Cache Memory
• Computer Science
The Journal of Supercomputing
• 2004
The results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory and can improve the total IPC significantly over the standard least recently used (LRU) replacement policy.
Managing Shared L 2 Caches on Multicore Systems in Software
• Computer Science
• 2007
A mechanism in the operating system that allows for partitioning of the shared L2 cache by guiding the allocation of physical pages provides isolation capabilities that lead to reduced contention and is shown to be effective in reducing cache contention in multiprogrammed SPECcpu2000 and SPECjbb2000 workloads.
QoS policies and architecture for cache/memory in CMP platforms
• Computer Science
SIGMETRICS '07
• 2007
A QoS-enabled memory architecture for CMP platforms that enables more cache resources and memory resources for high priority applications based on guidance from the operating environment and allows dynamic resource reassignment during run-time to further optimize the performance of the high priority application with minimal degradation to low priority.
Virtual private caches
ISCA '07
• 2007
The VPC Arbiter’s fairness policy, which distributes leftover bandwidth, mitigates the effects of cache preemption latencies, thus ensuring threads a high-degree of performance isolation and eliminates negative bandwidth interference which can improve aggregate throughput and resource utilization.
Communist, Utilitarian, and Capitalist cache policies on CMPs: Caches as a shared resource
• Computer Science
2006 International Conference on Parallel Architectures and Compilation Techniques (PACT)
• 2006
It is found that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation
• Computer Science
2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)
• 2006
This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors that can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees.
A new memory monitoring scheme for memory-aware scheduling and partitioning
• Computer Science
Proceedings Eighth International Symposium on High Performance Computer Architecture
• 2002
A scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy is described, which can be used to schedule jobs or to partition the cache to minimize the overall miss-rate.