Niladrish Chatterjee

Learn More
DRAM vendors have traditionally optimized the cost-per-bit metric, often making design decisions that incur energy penalties. A prime example is the overfetch feature in DRAM, where a single request activates thousands of bit-lines in many DRAM chips, only to return a single cache line to the CPU. The focus on cost-per-bit is questionable in modern-day(More)
USIMM, the Utah SImulated Memory Module, is a DRAM main memory system simulator that is being released for use in the Memory Scheduling Championship (MSC), organized in conjunction with ISCA-39. MSC is part of the JILP Workshops on Computer Architecture Competitions (JWAC). This report describes the simulation infrastructure and how it will be used within(More)
Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buffer as large as 8 KB on a memory request. But only a small(More)
Main memory latencies have always been a concern for system performance. Given that reads are on the critical path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and they often delay pending reads. In fact, a single channel in the main memory system offers almost no parallelism between reads and writes.(More)
The DRAM main memory system in modern servers is largely homogeneous. In recent years, DRAM manufacturers have produced chips with vastly differing latency and energy characteristics. This provides the opportunity to build a heterogeneous main memory system where different parts of the address space can yield different latencies and energy per access. The(More)
Many of the pins on a modern chip are used for power delivery. If fewer pins were used to supply the same current, the wires and pins used for power delivery would have to carry larger currents over longer distances. This results in an "IR-drop" problem, where some of the voltage is dropped across the long resistive wires making up the power delivery(More)
Memory controllers in modern GPUs aggressively reorder requests for high bandwidth usage, often interleaving requests from different warps. This leads to high variance in the latency of different requests issued by the threads of a warp. Since a warp in a SIMT architecture can proceed only when all of its memory requests are returned by memory, such latency(More)
Nearly every synchronous digital circuit today is designed with timing margins. These timing margins allow the circuit to behave correctly in spite of parameter variations, voltage noise, temperature fluctuations, etc. Given that the memory system is a critical bottleneck in several workloads, this paper attempts to safely push memory performance to its(More)
Future large-scale multi-cores will likely be best suited for use within high-performance computing (HPC) domains. A large fraction of HPC workloads employ the message-passing interface (MPI), yet multi-cores continue to be optimized for shared-memory workloads. In this position paper , we put forth the design of a unique chip that is optimized for MPI(More)