Learn More
Spatially-tiled architectures, such as Coarse-Grained Reconfigurable Arrays (CGRAs), are powerful architectures for accelerating applications in the digital-signal processing, embedded, and scientific computing domains. In contrast to Field-Programmable Gate Arrays (FPGAs), another common accelerator, they typically time-multiplex their processing elements(More)
Random forest classification is a well known machine learning technique that generates classifiers in the form of an ensemble ("forest") of decision trees. The classification of an input sample is determined by the majority classification by the ensemble. Traditional random forest classifiers can be highly effective, but classification using a random forest(More)
We present DI-MMAP, a high-performance runtime that memory-maps large external data sets into an application's address space and shows significantly better performance than the Linux mmap system call. Our implementation is particularly effective when used with high performance locally attached Flash arrays on highly concurrent, latency-tolerant(More)
In this paper we present SPR, a new architecture-adaptive mapping tool for use with Coarse-Grained Reconfigurable Architectures (CGRAs). It combines a VLIW style scheduler and FPGA style placement and pipelined routing algorithms with novel mechanisms for integrating and adapting the algorithms to CGRAs. We introduce a latency padding technique that(More)
Coarse-grained reconfigurable architectures (CGRAs) have the potential to offer performance approaching an ASIC with the flexibility, within an application domain, similar to a digital signal processor. In the past, coarse-grained reconfigurable architectures have been encumbered by challenging programming models that are either too far removed from the(More)
Programmable spatial fabrics, such as FPGAs, can provide some of the performance and efficiency benefits of custom hardware while retaining the low cost and flexibility of reprogrammable architectures. However, these fine-grained parallel architectures still have not been as widely adopted as many believe they could be for computationally intensive(More)
We present DI-MMAP, a high-performance runtime that memory-maps large external data sets into an application’s address space and shows significantly better performance than the Linux mmap system call. Our implementation is particularly effective when used with high performance locally attached Flash arrays on highly concurrent, latency-tolerant(More)
Recent advancement in the semiconductor technology allow the hardware engineers to integrate complex modules like processors, peripheral devices, and memory in a single System-on-a-Chip (SoC); where testability, power minimization and management, area minimization are the important system level considerations. Performances both in terms of processing speed(More)
We present a work-in-progress snapshot of learning with a 15 billion parameter deep learning network on HPC architectures applied to the largest publicly available natural image and video dataset released to-date. Recent advancements in unsupervised deep neural networks suggest that scaling up such networks in both model and training dataset size can yield(More)