Stanislav G. Sedukhin

Learn More
The two-dimensional (2D) forward/inverse discrete Fourier transform (DFT), discrete cosine transform (DCT), discrete sine transform (DST), discrete Hartley transform (DHT), discrete Walsh-Hadamard transform (DWHT), play a fundamental role in many practical applications. Due to the separability property, all these transforms can be uniquely defined as a(More)
OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes performance of OpenCL programs also portable on different processors. We have developed an auto-tuning system with(More)
In this paper we propose a novel “Mesh-of-Tori” cellular interconnection network for scalable and massively parallel array processors with frontal plane I/O. The unit (called “m-Cell”) in this topology is the smallest double (for 2D case) or triple (for 3D) folded torus, which forms the basic ‘tile’. The Cells(More)
This paper presents a blocked algorithm for the all-pairs shortest paths (APSP) problem for a hybrid CPU-GPU system. In the blocked APSP algorithm, the amount of data communication between CPU (host) memory and GPU memory is minimized. When a problem size (the number of vertices in a graph) is large enough compared with a blocking factor, the blocked(More)
SUMMARY The All-Pairs Shortest Paths (APSP) problem is a graph problem which can be solved by a three-nested loop program. The Cell Broadband Engine (Cell/B.E.) is a heterogeneous multi-core processor that offers the high single precision floating-point performance. In this paper, a solution of the APSP problem on the Cell/B.E. is presented. To maximize the(More)
This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is(More)
In this paper, the index space of the (n times n)-matrix multiply-add problem C = C + AmiddotB is represented as a 3D n times n times n torus. All possible time-scheduling functions to activate the computation and data rolling inside the 3D torus index space are determined. To maximize efficiency when solving a single problem, we mapped the computations(More)