Efficient and Reusable Implementation of Fine-Grain Multithreading and Garbage Collection on Distributed-Memory Parallel Computers

Abstract

This thesis studies e cient runtime systems for parallelism management (multithreading) and memory management (garbage collection) on largescale distributed-memory parallel computers. Both are fundamental primitives for implementing high-level parallel programming languages that support dynamic parallelism and dynamic data structures. A distinguishing feature of the developed multithreading system is that it tolerates a large number of threads in a single CPU while allowing direct reuse of existing sequential C compilers. In fact, it is able to turn any standard C procedure call into an asynchronous one. Having such a runtime system, the compiler of a high-level parallel programming language can fork a new thread simply by a C procedure call to a corresponding C function. A thread can block its execution by calling a library procedure that saves the stack frame of the thread and unwinds stack frames. To resume a thread, StackThreads provides another runtime routine that rebuilds the saved stack frame on top of the current stack and restarts the computation from the blocking point. All these operations are implemented by using information already present on standard C stack frames, without requiring a frame format customized for a particular programming language. Experiments demonstrate that potential performance problems are not signi cant in practice, even on distributed memory computers in which each remote access causes a thread switch. The developed garbage collection system is a simple mark & sweep collector that stops the user program while collecting. We show viability of such collectors on a large scale (up to 256 processors) distributed memory computer (Fujitsu AP1000+). Under a reasonable heap expansion policy, garbage collection occupies at most 15% of the application time (excluding idle time). More importantly, the overhead of garbage collection on parallel machines was, except for one application, in the ballpark of that on a single processor, indicating that garbage collection is at least as scalable as the applications. Another observation from the experiment is that independent local collection is a dangerous strategy which degrades performance of synchronous applications severely (by up to 60%), contradicting previous believes that garbage collections should be done as independently as possible. This is because an independent local collection makes the collecting processor \unresponsive," making processors waiting for a reply from the collecting processor idle. For asynchronous applications with plenty of intra-node parallelism, independent collections perform better than synchronous collections, but the di erence is small at least in our experiments. A more advanced strategy which adaptively selects a right strategy is also implemented and shown to be e ective, though it is not signi cantly better than a simpler \always-synchronous" approach in the current experimental conditions. On top of these runtime systems, a new programming language ABCL/f is designed and implemented. Several non-trivial applications written by the author and others are used for experiments. Both sequential performance and speedup of the applications are reported.

11 Figures and Tables

Cite this paper

@inproceedings{Taura1997EfficientAR, title={Efficient and Reusable Implementation of Fine-Grain Multithreading and Garbage Collection on Distributed-Memory Parallel Computers}, author={Kenjiro Taura}, year={1997} }