Automatic Methods for Hiding Latency in Parallel and Distributed Computation

  title={Automatic Methods for Hiding Latency in Parallel and Distributed Computation},
  author={Matthew Andrews and Frank Thomson Leighton and Panagiotis Takis Metaxas and Lisa Zhang},
  journal={SIAM J. Comput.},
In this paper we describe methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks. For example, given any "dataflow" type of algorithm that runs in T steps on an n-node ring with unit link delays, we show how to run the algorithm in O(T) steps on any n-node bounded-degree connected network with average link delay O(1). This is a significant improvement over prior approaches to latency hiding, which require slowdowns proportional to the… 


On the fault tolerance of some popular bounded-degree networks
The authors analyze the fault-tolerance properties of several bounded-degree networks and show that an N-node butterfly containing N/sup 1- epsilon / worst-case faults can emulate a fault-free butterfly of the same size with only constant slowdown, becoming the first connected bounded- degree networks known to be able to sustain more than a constant number of worst- case faults.
Asymptotically tight bounds for computing with faulty arrays of processors
It is proved that in either scenario low-dimensional arrays are surprisingly fault tolerant, and how to route, sort, and perform systolic algorithms for problems such as matrix multiplication in optimal time on faulty arrays is shown.
A Communication-Time Tradeoff
A nontrivial tradeoff between the communication c and time t required to compute a collection of values whose dependencies form a grid, where there must be a single path through the grid along which there are t communication steps, where $(d + 1)t = \Omega (n^2 )$.
Multi-scale self-simulation: a technique for reconfiguring arrays with faults
If faulty nodes are allowed to communicate, but not compute, then an N-node one-dimensional array can tolerate logO(lJ N worst-case faults and still emulate a fault-free array with constant slowdown, and this bound is tight.
Efficient Out-of-Core Algorithms for Linear Relaxation Using Blocking Covers
A general method that can save substantially on the I O traffic for many problems in out-of-core algorithms for sparse linear relaxation problems in which each iteration of the algorithm updates the state of every vertex in a graph with a linear combination of the states of its neighbors.
Task Clustering and Scheduling for Distributed Memory Parallel Architectures
A simple greedy algorithm is presented for the problem of scheduling parallel programs represented as directed acyclic task graphs for execution on distributed memory parallel architectures which runs in O(n(n lg n+e) time, which is n times faster than the currently best known algorithm for this problem.
Towards an architecture-independent analysis of parallel algorithms
It would be very interesting if the authors could combine stages (3) and (4) into a single step whereby the performance of the algorithm is measured as the makespan of the schedule (elapsed time for computing the last result).
Lower Bounds and Efficient Algorithms for Multiprocessor Scheduling of Directed Acyclic Graphs with Communication Delays
An nτ+1 algorithm for optimally scheduling a dag of n nodes on a multiprocessor when the message-to-instruction ratio parameter is τ is presented, which constructs an optimum schedule which uses at most n processors.
Implementation of a Portable Nested Data-Parallel Language
Initial benchmark results of NESL show that NESL′s performance is competitive with that of machine-specific codes for regular dense data, and is often superior for irregular data.
Bulk synchronous parallel computing
The router operates independently of the computational and memory elements and masks any substantial latency it may have by pipelining, and a synchronizer provides for bulk synchronization in supersteps of multiple computational steps.