Robert S. Schreiber

Learn More
In the push to achieve exascale performance, systems will grow to over 100,000 sockets, as growing cores-per-socket and improved single-core performance provide only part of the speedup needed. These systems will need affordable interconnect structures that scale to this level. To meet the need, we consider an extension of the hypercube and flattened(More)
Gaussian elimination with partial pivoting is unstable in the worst case: the "growth factor" can be as large as 2"-l, where n is the matrix dimension, resulting in a loss of n bits of precision. It is proposed that an average-case analysis can help explain why it is nevertheless stable in practice. The results presented begin with the observation that for(More)
—Demand for memory capacity and bandwidth keeps increasing rapidly in modern computer systems, and memory power consumption is becoming a considerable portion of the system power budget. However, the current DDR DIMM standard is not well suited to effectively serve CMP memory requests from both a power and performance perspective. We propose a new memory(More)
Continuous evolution in process technology brings energy-efficiency and reliability challenges, which are harder for memory system designs since chip multiprocessors demand high bandwidth and capacity, global wires improve slowly, and more cells are susceptible to hard and soft errors. Recently, there are proposals aiming at better main-memory energy(More)
The polar decomposition of an m x n matrix A of full rank, where rn n, can be computed using a quadratically convergent algorithm of Higham SIAMJ. The algorithm is based on a Newton iteration involving a matrix inverse. It is shown how, with the use of a preliminary complete orthogonal decomposition, the algorithm can be extended to arbitrary A. The use of(More)
Many of the currently popular 'block algorithms' are scalar algorithms in which the operations have been grouped and reordered into matrix operations. One genuine block algorithm in practical use is block LU factorization, and this has recently been shown by Demmel and Higham to be unstable in general. It is shown here that block LU factorization is stable(More)
It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks. In this paper we show that array-based languages such(More)
To handle the demand for very large main memory, we are likely to use nonvolatile memory (NVM) as main memory. NVM main memory will have higher latency than DRAM. To cope with this, we advocate a less-deep cache hierarchy based on a large last-level, NVM cache. We develop a model that estimates average memory access time and power of a cache hierarchy. The(More)
Programming with atomic sections is a promising alternative to locks since it raises the abstraction and removes deadlocks at the programmer level. However, implementations of atomic sections using software transactional memory (STM) support have significant bookkeeping overheads. Additionally, because of the speculative nature of transactions, aborts can(More)
Matter structured on a length scale comparable to or smaller than the wavelength of light can exhibit unusual optical properties. Particularly promising components for such materials are metal nanostructures, where structural alterations provide a straightforward means of tailoring their surface plasmon resonances and hence their interaction with light. But(More)