Learn More
While Processing-in-Memory has been investigated for decades, it has not been embraced commercially. A number of emerging technologies have renewed interest in this topic. In particular, the emergence of 3D stacking and the imminent release of Micron's Hybrid Memory Cube device have made it more practical to move computation near memory. However, the(More)
In a hardware transactional memory system with lazy versioning and lazy conflict detection, the process of transaction commit can emerge as a bottleneck. This is especially true for a large-scale distributed memory system where multiple transactions may attempt to commit simultaneously and coordination is required before allowing commits to proceed in(More)
USIMM, the Utah SImulated Memory Module, is a DRAM main memory system simulator that is being released for use in the Memory Scheduling Championship (MSC), organized in conjunction with ISCA-39. MSC is part of the JILP Workshops on Computer Architecture Competitions (JWAC). This report describes the simulation infrastructure and how it will be used within(More)
Snooping and directory-based coherence protocols have become the de facto standard in chip multi-processors, but neither design is without drawbacks. Snooping protocols are not scalable, while directory protocols incur directory storage overhead, frequent indirections, and are more prone to design bugs. In this paper, we propose a novel coherence protocol(More)
Memory latency is a major factor in limiting CPU performance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting performance. It is therefore important to employ prefetching mechanisms that use these resources(More)
Prior work in hardware prefetching has focused mostly on either predicting regular streams with uniform strides, or predicting irregular access patterns at the cost of large hardware structures. This paper introduces the Variable Length Delta Prefetcher (VLDP), which builds up delta histories between successive cache line misses within physical pages, and(More)
Future large-scale multi-cores will likely be best suited for use within high-performance computing (HPC) domains. A large fraction of HPC workloads employ the message-passing interface (MPI), yet multi-cores continue to be optimized for shared-memory workloads. In this position paper , we put forth the design of a unique chip that is optimized for MPI(More)
Multiple virtual machines (VMs) are typically co-scheduled on cloud servers. Each VM experiences different latencies when accessing shared resources, based on contention from other VMs. This introduces timing channels between VMs that can be exploited to launch attacks by an untrusted VM. This paper focuses on trying to eliminate the timing channel in the(More)
A large fraction of MapReduce execution time is spent processing the Map phase, and a large fraction of Map phase execution time is spent sorting the intermediate key-value pairs generated by the Map function. Sorting accelerators can achieve high performance and low power because they lack the overheads of sorting implementations on general purpose(More)