Learn More
Spatially-tiled architectures, such as Coarse-Grained Re-configurable Arrays (CGRAs), are powerful architectures for accelerating applications in the digital-signal processing, embedded, and scientific computing domains. In contrast to Field-Programmable Gate Arrays (FPGAs), another common accelerator, they typically time-multiplex their processing elements(More)
In this paper we present SPR, a new architecture-adaptive mapping tool for use with Coarse-Grained Reconfigurable Architectures (CGRAs). It combines a VLIW style scheduler and FPGA style placement and pipelined routing algorithms with novel mechanisms for integrating and adapting the algorithms to CGRAs. We introduce a latency padding technique that(More)
—Random forest classification is a well known machine learning technique that generates classifiers in the form of an ensemble (" forest ") of decision trees. The classification of an input sample is determined by the majority classification by the ensemble. Traditional random forest classifiers can be highly effective, but classification using a random(More)
—We present DI-MMAP, a high-performance run-time that memory-maps large external data sets into an ap-plication's address space and shows significantly better performance than the Linux mmap system call. Our implementation is particularly effective when used with high performance locally attached Flash arrays on highly concurrent, latency-tolerant(More)
Coarse-grained reconfigurable architectures (CGRAs) have the potential to offer performance approaching an ASIC with the flexibility, within an application domain, similar to a digital signal processor. In the past, coarse-grained reconfigurable architectures have been encumbered by challenging programming models that are either too far removed from the(More)
Programmable spatial fabrics, such as FPGAs, can provide some of the performance and efficiency benefits of custom hardware while retaining the low cost and flexibility of reprogrammable architectures. However, these fine-grained parallel architectures still have not been as widely adopted as many believe they could be for computationally intensive(More)
We present DI-MMAP, a high-performance runtime that memory-maps large external data sets into an application’s address space and shows significantly better performance than the Linux mmap system call. Our implementation is particularly effective when used with high performance locally attached Flash arrays on highly concurrent, latency-tolerant(More)
We present a work-in-progress snapshot of learning with a 15 billion parameter deep learning network on HPC architectures applied to the largest publicly available natural image and video dataset released to-date. Recent advancements in unsupervised deep neural networks suggest that scaling up such networks in both model and training dataset size can yield(More)
Coprocessor accelerator architectures like FPGAs and GPUs are increasingly used in embedded systems because of their high performance on computation-heavy inner loops of a variety of applications. However, current languages and compilers for these archi-tectures make it challenging to efficiently implement kernels that have complex, input-dependent control(More)
—Efficient storage in spatial processors is increasingly important as such devices get larger and support more concurrent operations. Unlike sequential processors that rely heavily on centralized storage, e.g. register files and embedded memories, spatial processors require many small storage structures to efficiently manage values that are distributed(More)