Automatic Methods for Hiding Latency in Parallel and Distributed Computation
@article{Andrews1999AutomaticMF, title={Automatic Methods for Hiding Latency in Parallel and Distributed Computation}, author={Matthew Andrews and Frank Thomson Leighton and Panagiotis Takis Metaxas and Lisa Zhang}, journal={SIAM J. Comput.}, year={1999}, volume={29}, pages={615-647} }
In this paper we describe methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks. For example, given any "dataflow" type of algorithm that runs in T steps on an n-node ring with unit link delays, we show how to run the algorithm in O(T) steps on any n-node bounded-degree connected network with average link delay O(1). This is a significant improvement over prior approaches to latency hiding, which require slowdowns proportional to the…
Figures and Tables from this paper
References
SHOWING 1-10 OF 31 REFERENCES
On the fault tolerance of some popular bounded-degree networks
- Computer ScienceProceedings., 33rd Annual Symposium on Foundations of Computer Science
- 1992
The authors analyze the fault-tolerance properties of several bounded-degree networks and show that an N-node butterfly containing N/sup 1- epsilon / worst-case faults can emulate a fault-free butterfly of the same size with only constant slowdown, becoming the first connected bounded- degree networks known to be able to sustain more than a constant number of worst- case faults.
Asymptotically tight bounds for computing with faulty arrays of processors
- Computer ScienceProceedings [1990] 31st Annual Symposium on Foundations of Computer Science
- 1990
It is proved that in either scenario low-dimensional arrays are surprisingly fault tolerant, and how to route, sort, and perform systolic algorithms for problems such as matrix multiplication in optimal time on faulty arrays is shown.
A Communication-Time Tradeoff
- Computer ScienceSIAM J. Comput.
- 1987
A nontrivial tradeoff between the communication c and time t required to compute a collection of values whose dependencies form a grid, where there must be a single path through the grid along which there are t communication steps, where $(d + 1)t = \Omega (n^2 )$.
Multi-scale self-simulation: a technique for reconfiguring arrays with faults
- Computer ScienceSTOC '93
- 1993
If faulty nodes are allowed to communicate, but not compute, then an N-node one-dimensional array can tolerate logO(lJ N worst-case faults and still emulate a fault-free array with constant slowdown, and this bound is tight.
Efficient Out-of-Core Algorithms for Linear Relaxation Using Blocking Covers
- Computer Science, MathematicsJ. Comput. Syst. Sci.
- 1997
A general method that can save substantially on the I O traffic for many problems in out-of-core algorithms for sparse linear relaxation problems in which each iteration of the algorithm updates the state of every vertex in a graph with a linear combination of the states of its neighbors.
Task Clustering and Scheduling for Distributed Memory Parallel Architectures
- Computer ScienceIEEE Trans. Parallel Distributed Syst.
- 1996
A simple greedy algorithm is presented for the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributed memory parallel architectures which runs in O(n(n lg n+e) time, which is n times faster than the currently best known algorithm for this problem.
Towards an architecture-independent analysis of parallel algorithms
- Computer ScienceSTOC '88
- 1988
It would be very interesting if the authors could combine stages (3) and (4) into a single step whereby the performance of the algorithm is measured as the makespan of the schedule (elapsed time for computing the last result).
Lower Bounds and Efficient Algorithms for Multiprocessor Scheduling of Directed Acyclic Graphs with Communication Delays
- Computer ScienceInf. Comput.
- 1993
An nτ+1 algorithm for optimally scheduling a dag of n nodes on a multiprocessor when the message-to-instruction ratio parameter is τ is presented, which constructs an optimum schedule which uses at most n processors.
Implementation of a Portable Nested Data-Parallel Language
- Computer ScienceJ. Parallel Distributed Comput.
- 1994
Initial benchmark results of NESL show that NESL′s performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.
Bulk synchronous parallel computing
- Computer Science
- 1995
The router operates independently of the computational and memory elements and masks any substantial latency it may have by pipelining, and a synchronizer provides for bulk synchronization in supersteps of multiple computational steps.