Learn More
Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache performance when the aggregate working set size is greater than the shared cache. In such cases, shared(More)
The shared last-level caches in CMPs play an important role in improving application performance and reducing off-chip memory bandwidth requirements. In order to use LLCs more efficiently, recent research has shown that changing the re-reference prediction on cache insertions and cache hits can significantly improve cache performance. A fundamental(More)
It is widely acknowledged that efficient modular multiplication is a key to high-performance implementation of public-key cryptography, be it classical RSA, Diffie-Hellman, or (hyper-) elliptic curve algorithms. In the recent decade, practitioners have relied mainly on two popular methods: Montgomery Multiplication and regular long-integer multiplication in(More)
This paper introduces the largest-log-degree-first (LLF) and smallest-log-degree-last (SLL) ordering heuristics for parallel greedy graph-coloring algorithms, which are inspired by the largest-degree-first (LF) and smallest-degree-last (SL) serial heuristics, respectively. We show that although LF and SL, in practice, generate colorings with relatively(More)
A data-graph computation—popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi—is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex’s(More)
This paper addresses the problem of partitioning a cache between multiple concurrent threads and in the presence of hardware prefetching. Cache replacement designed to preserve temporal locality (e.g., LRU) will allocate cache resources proportional to the miss-rate of each competing thread irrespective of whether the cache space will be utilized [Qureshi(More)
Hardware transactions offer a performance advantage over software implementations by harnessing the power of existing cache coherence mechanisms which are already fast, automatic, and parallel. The source of superior performance, however, is also the root of their weakness: existing implementations of hardware transactions abort when the working set exceeds(More)
Cache replacement techniques like LRU, MRU etc. that are currently being deployed across multi-core architecture platforms, try to classify elements purely based on the number of hits they receive during their stay in the cache. In multi-threaded applications data can be shared by multiple threads (which might run on the same core or across different(More)
High memory contention is generally agreed to be a worst-case scenario for concurrent data structures. There has been a significant amount of research effort spent investigating designs which minimize contention, and several programming techniques have been proposed to mitigate its effects. However, there are currently few architectural mechanisms to allow(More)