Qthreads: An API for programming with millions of lightweight threads
Modern supercomputers like CRAY X-MP and IBM 3090 VF achieve their high computing speed by using both vector and parallel hardware. The available multitasking concepts supporting concurrent execution of tasks within a single application have been designed for different purposes: owing to the small dispatching overhead, fine-grain parallelism allows parallelization of small units of computation, usually chunks of a DO loop. Larger units of computation, such as arithmetic intensive subroutines, may be processed independently using coarse-grain parallelism. This paper gives an introduction to the concepts of CRAY macro- and microtasking, and of IBM Multitasking Facility (MTF), the ECSEC microtasking prototype, and Parallel FORTRAN. Basic parallelization using fine-grain as well as coarse-grain techniques have been applied to linear algebra kernels, consisting in matrix multiplication and LU decomposition, and an application program simulating a Czochralski bulk flow describing a crystal growing system. Depending on the problem, it can be shown that a parallel speed up of nearly four (on the CRAY X-MP/416) and nearly six (on the IBM 3090-600E) can be achieved for the implementation of the matrix multiplication. All other kernels and the application program were limited by serialization overheads arising from memory conflicts (bank and section conflicts on CRAY, cache coherence on IBM) and multitasking primitive overheads. However, with a careful implementation a parallel efficiency of more than 0.9 can be obtained on both multiprocessors.
Unfortunately, ACM prohibits us from displaying non-influential references for this paper.
To see the full reference list, please visit http://dl.acm.org/citation.cfm?id=318819.