Stanislav G. Sedukhin

Learn More
This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is(More)
This paper presents a blocked algorithm for the all-pairs shortest paths (APSP) problem for a hybrid CPU-GPU system. In the blocked APSP algorithm, the amount of data communication between CPU (host) memory and GPU memory is minimized. When a problem size (the number of vertices in a graph) is large enough compared with a blocking factor, the blocked(More)
OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes performance of OpenCL programs also portable on different processors. We have developed an auto-tuning system with(More)
In this paper we propose a novel “Mesh-of-Tori” cellular interconnection network for scalable and massively parallel array processors with frontal plane I/O. The unit (called “m-Cell”) in this topology is the smallest double (for 2D case) or triple (for 3D) folded torus, which forms the basic ‘tile’. The Cells(More)
In this paper, the index space of the (n times n)-matrix multiply-add problem C = C + AmiddotB is represented as a 3D n times n times n torus. All possible time-scheduling functions to activate the computation and data rolling inside the 3D torus index space are determined. To maximize efficiency when solving a single problem, we mapped the computations(More)
The two-dimensional (2D) forward/inverse discrete Fourier transform (DFT), discrete cosine transform (DCT), discrete sine transform (DST), discrete Hartley transform (DHT), discrete Walsh-Hadamard transform (DWHT), play a fundamental role in many practical applications. Due to the separability property, all these transforms can be uniquely defined as a(More)
The All-Pairs Shortest Paths (APSP) problem is a graph problem which can be solved by a three-nested loop program. The Cell Broadband Engine (Cell/B.E.) is a heterogeneous multi-core processor that offers the high single precision floating-point performance. In this paper, a solution of the APSP problem on the Cell/B.E. is presented. To maximize the(More)