Reconstructing Hardware Transactional Memory for Workload Optimized Systems

  title={Reconstructing Hardware Transactional Memory for Workload Optimized Systems},
  author={Kunal Korgaonkar and Prabhat Jain and Deepak Tomar and Kashyap Garimella and Kamakoti Veezhinathan},
Workload optimized systems consisting of large number of general and special purpose cores, and with a support for shared memory programming, are slowly becoming prevalent. [] Key Result Overall, we show how the knowledge about the workload is extremely useful to make appropriate design choices in the workload optimized HTM.


Hardware Acceleration of Software Transactional Memory
RTM is presented, in which hardware is used to accelerate a TM implementation controlled fundamentally by software, and allows for a wide variety of policies for contention management, deadlock and livelock avoidance, data granularity, nesting, and virtualization.
OS Support for Virtualizing Hardware Transactional Memory
It is found that aborting a transaction is generally faster than virtualizing it, and hence preferable in some cases, and it is shown virtualizing transactions can be necessary for system stability and to support code that voluntarily context switches.
Software transactional memory
STM is used to provide a general highly concurrent method for translating sequential object implementations to non-blocking ones based on implementing a k-word compare&swap STM-transaction, and outperforms Herlihy’s translation method for sufficiently large numbers of processors.
An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs
This paper proposes a programming framework that combines the ease of use of OpenMP with simple, yet powerful, language extensions to trigger array data partitioning and exploits profiled information on array access count to automatically generate data allocation schemes optimized for locality of references.
Hybrid transactional memory
Using a simulated multiprocessor with HTM support, the viability of the HyTM approach is demonstrated: it can provide performance and scalability approaching that of an unbounded HTM implementation, without the need to support all transactions with complicatedHTM support.
Design and implementation of software-managed caches for multicores with local memory
This paper proposes a new software-managed cache design, called extended set-index cache (ESC), which has the benefits of both set-associative and fully associative caches and is applicable to all cores with access to both local and global memory in a multicore architecture.
Adaptive insertion policies for managing shared caches
This paper proposes Thread-Aware Dynamic Insertion Policy (TADIP), a adaptive insertion policy that can take into account the memory requirements of each of the concurrently executing applications and provides performance benefits similar to doubling the size of an LRU-managed cache.
CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator
This paper presents the use of binary instrumentation as an alternative to execution-driven and trace-driven simulation methodologies to explore the design space of a CMP memory hierarchy and presents CMP$im to characterize cache performance of single-threaded, multi- threaded, and multi-programmed workloads at the speeds of 4-10 MIPS.
Partitioning and allocation of scratch-pad memory for priority-based preemptive multi-task systems
The three methods which are proposed, i.e., spatial, temporal, and hybrid methods, bring about effective usage of the scratch-pad memory space, and achieve energy reduction in the instruction memory subsystems, are applicable to a real-time environment.
COMIC: A coherent shared memory interface for cell BE
A memory consistency model and a programming model for COMIC is proposed, in which the management of synchronization and coherence is centralized in the PPE, which provides the program with an illusion of a globally shared memory.