Learn More
The introduction of multi-core processors has renewed the interest in programming models which can efficiently exploit general purpose parallelism. Data-Flow is one such model which has demonstrated significant potential in the past. However, it is generally associated with functional styles of programming which do not deal well with shared mutable state.(More)
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes used today in communication standards such as DVB-S2 and WiMAX to transmit data inside noisy channels with high error probability. LDPC decoding is computationally demanding and requires irregular accesses to memory which makes it suitable for parallelization. The recent introduction(More)
The number of computational units integrated in a single processor is rapidly increasing. This suggests that applications will require efficient and effective ways to exploit the parallelism to achieve the performance offered by large-scale multicore processors. The efficient parallelization of the applications relies on the programming and execution(More)
Our target in this work is to study ways of exploring the parallelism offered by vectorization on accelerators with very wide vector units. To this end, we implemented two kernels that derive from the Wilson Dslash operator and investigate several data layout techniques for increasing the scalability of lattice QCD scientific kernels suitable for the Intel(More)
The current trend in processor design is to increase the number of cores as to achieve a desired performance. While having a large number of cores on a chip seems to be feasible in terms of the hardware, the development of the software that is able to exploit that parallelism is one of the biggest challenges. In this paper we propose a Data-Flow based(More)
Exploiting the recently introduced very-wide vector units of the Xeon Phi coprocessor can potentially increase the scalability for scientific applications. Using lattice QCD compute kernels, the authors find that the performance achieved using the Xeon Phi coprocessors wide vector units is similar to GPGPU performance after appropriate code refactoring,(More)
Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that process large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessors. In this work we exploit the benefits of using future many-core architectures, more specifically on-chip clustered many-core architectures.(More)
  • 1