## Figures and Tables from this paper

figure 3.1 figure 3.10 figure 3.11 figure 3.12 figure 3.13 figure 3.14 figure 3.15 figure 3.16 figure 3.17 figure 3.18 figure 3.2 figure 3.3 figure 3.4 figure 3.5 figure 3.6 figure 3.7 figure 3.8 figure 3.9 table 4.1 figure 5.1 table 5.1 figure 5.2 table 5.2 figure 5.3 figure 5.4 figure 5.5 figure 5.6 figure 5.7 figure 5.8 figure 6.1 figure 6.2 figure 6.3 figure 7.1 figure 7.2 figure 7.3 figure 9.1 figure 9.2

## 179 Citations

### The Economic Impact of Moore's Law: Evidence from When it Faltered

- Economics
- 2017

“Computing performance doubles every couple of years” is the popular re-phrasing of Moore’s Law, which describes the 500,000-fold increase in the number of transistors on modern computer chips. But…

### Communication optimization in iterative numerical algorithms : an algorithm-architecture interaction

- Computer Science
- 2013

This work shows how to select the unroll factor k in an architecture-agnostic manner to provide communication-computation tradeoff on FPGA and GPU, and presents a new algorithm for the FPGAs which matches with their strength to reduce redundant computation to allow large k and hence higher speedups.

### Communication lower bounds and optimal algorithms for programs that reference arrays - Part 1

- Computer ScienceArXiv
- 2013

This work generalizes the lower bound approach used initially for Theta(N3) matrix multiplication to a much larger class of algorithms, that may have arbitrary numbers of loops and arrays with arbitrary dimensions as long as the index expressions are a ne combinations of loop variables.

### Multilevel communication optimal LU and QR factorizations for hierarchical platforms

- Computer ScienceArXiv
- 2013

This study focuses on the performance of two classical dense linear algebra algorithms, the LU and the QR factorizations, on multilevel hierarchical platforms. We first introduce a new model called…

### Overlapping clusters for distributed computation

- Computer ScienceWSDM '12
- 2012

This work describes a graph decomposition algorithm for the paradigm where the partitions intersect and describes a framework for distributed computation across a collection of overlapping clusters and how this framework can be used in various algorithms based on the graph diffusion process.

### Some issues in dense linear algebra for multicore and special purpose architectures

- Computer Science
- 2008

We address some key issues in designing dense linear algebra
(DLA) algorithms that are common for both multi/many-cores and
special purpose architectures (in particular GPUs). We present them in
the…

### A Threaded Parallel Code for Pricing Discrete Asian Options on SMP Systems

- Computer Science20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS'06)
- 2006

Three implementations of a parallel algorithm for pricing discrete Asian options are described: one using message passing interface (MPI), one using OpenMP and one using POSIX threads through a high level FORTRAN API.

### Prospectus for the Next LAPACK and ScaLAPACK Libraries

- Computer SciencePARA
- 2006

Based on an on-going user survey and research by many people, the following improvements are proposed: Faster algorithms, including better numerical methods, memory hierarchy optimizations, parallelism, and automatic performance tuning to accommodate new architectures.

### Red Storm Capability Computing Queuing Policy.

- Medicine
- 2005

The basic queuing policy design is described along with extensions to handle switching between classified and unclassified, use by ASC university partners, priority access, etc.

### Brief Announcement: On the I/O Complexity of Sequential and Parallel Hybrid Integer Multiplication Algorithms

- Computer ScienceProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures
- 2022

This work presents an Ω((n/max(M,n0) (max(1,n_0/M))2M) lower bound for the I/O complexity of a class of "uniform, non-stationary" hybrid algorithms, where n0 denotes the threshold size of sub-problems which are computed using standard algorithms with algebraic complexity Ω(n2).

## References

SHOWING 1-3 OF 3 REFERENCES

### Co-array Fortran for parallel programming

- Computer ScienceFORF
- 1998

The extension of Co-Array Fortran is introduced; examples to illustrate how clear, powerful, and flexible it can be; and a technical definition is provided.

### Global arrays: A nonuniform memory access programming model for high-performance computers

- Computer ScienceThe Journal of Supercomputing
- 2004

The key concept of GAs is that they provide a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes.

### OpenMP: an industry standard API for shared-memory programming

- Computer Science
- 1998

At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++ to express shared memory parallelism. It…