Snehasish Kumar

Learn More
State-of-the-art multiprocessor cache hierarchies propagate the use of a fixed granularity in the cache organization to the design of the coherence protocol. Unfortunately, the fixed granularity, generally chosen to match average spatial locality across a range of applications, not only results in wasted bandwidth to serve an individual thread's access(More)
The fixed geometries of current cache designs do not adapt to the working set requirements of modern applications, causing significant inefficiency. The short block lifetimes and moderate spatial locality exhibited by many applications result in only a few words in the block being touched prior to eviction. Unused words occupy between 17 -- 80% of a 64K L1(More)
Chip designers have shown increasing interest in integrating specialized fixed-function coprocessors into multicore designs to improve energy efficiency. Recent work in academia [11, 37] and industry [16] has sought to enable more fine-grain offloading at the granularity of functions and loops. The sequential program now needs to migrate across the chip(More)
Software data structures are a critical aspect of emerging data-centric applications which makes it imperative to improve the energy efficiency of data delivery. We propose SQRL, a hardware accelerator that integrates with the last-level-cache (LLC) and enables energy-efficient iterative computation on data structures. SQRL integrates a data(More)
Recent research [3,37,38] has proposed compute accelerators to address the energy efficiency challenge. While these compute accelerators specialize and improve the compute efficiency, they have tended to rely on address-based load/store memory interfaces that closely resemble a traditional processor core. The address-based load/store interface is(More)
The end of Dennard Scaling has necessitated research into the adoption of specialized architectures for offloading specific code regions in applications. Recent works in accelerator architectures have chosen diverse workloads and even diverse code regions (within the same workload) to highlight the efficacy of specific accelerator architectures. However(More)
Technology constraints have increasingly led to the adoption of specialized coprocessors, i.e. hardware accelerators. The first challenge that computer architects encounter is identifying "what to specialize in the program". We demonstrate that this requires precise enumeration of program paths based on dynamic program behavior. We hypothesize(More)
The exponential increase of multimedia services by the mobile users requires seamless connectivity with costeffective Quality of Service (QoS) provisioning. For providing such on-demand QoS, the network needs to utilize the radio channels among the Mobile Hosts (MHs) effectively. We use vector genetic algorithm (VGA) for temporal imploration of sharable(More)
A central tenet behind accelerators is to partition a program execution into regions with different behavior (e.g., SIMD, Irregular, Compute-Intensive) and then use behaviorspecialized architectures [1] for each region. It is unclear whether the gains in efficiency arise from recognizing that a simpler microarchitecture is sufficient for the acceleratable(More)
There exist a multitude of execution models available today for a developer to target. The choices vary from general purpose processors to fixed-function hardware accelerators with a large number of variations in-between. There is a growing demand to assess the potential benefits of porting or rewriting an application to a target architecture in order to(More)
  • 1