Learn More
In the push to achieve exascale performance, systems will grow to over 100,000 sockets, as growing cores-per-socket and improved single-core performance provide only part of the speedup needed. These systems will need affordable interconnect structures that scale to this level. To meet the need, we consider an extension of the hypercube and flattened(More)
Blocked algorithms have m uch better properties of data locality and therefore can be much more eecient than ordinary algorithms when a memory hierarchy i s 1 involved. On the other hand, they are very diicult to write and to tune for particular machines. Here we consider the reorganization of nested loops through the use of known program transformations in(More)
—Demand for memory capacity and bandwidth keeps increasing rapidly in modern computer systems, and memory power consumption is becoming a considerable portion of the system power budget. However, the current DDR DIMM standard is not well suited to effectively serve CMP memory requests from both a power and performance perspective. We propose a new memory(More)
It is cumbersome to write machine learning and graph algorithms in data-parallel models such as MapReduce and Dryad. We observe that these algorithms are based on matrix computations and, hence, are inefficient to implement with the restrictive programming and communication interface of such frameworks. In this paper we show that array-based languages such(More)
Continuous evolution in process technology brings energy-efficiency and reliability challenges, which are harder for memory system designs since chip multiprocessors demand high bandwidth and capacity, global wires improve slowly, and more cells are susceptible to hard and soft errors. Recently, there are proposals aiming at better main-memory energy(More)
The polar decomposition of an m x n matrix A of full rank, where rn n, can be computed using a quadratically convergent algorithm of Higham SIAMJ. The algorithm is based on a Newton iteration involving a matrix inverse. It is shown how, with the use of a preliminary complete orthogonal decomposition, the algorithm can be extended to arbitrary A. The use of(More)
Many of the currently popular 'block algorithms' are scalar algorithms in which the operations have been grouped and reordered into matrix operations. One genuine block algorithm in practical use is block LU factorization, and this has recently been shown by Demmel and Higham to be unstable in general. It is shown here that block LU factorization is stable(More)
To handle the demand for very large main memory, we are likely to use nonvolatile memory (NVM) as main memory. NVM main memory will have higher latency than DRAM. To cope with this, we advocate a less-deep cache hierarchy based on a large last-level, NVM cache. We develop a model that estimates average memory access time and power of a cache hierarchy. The(More)
Programming with atomic sections is a promising alternative to locks since it raises the abstraction and removes deadlocks at the programmer level. However, implementations of atomic sections using software transactional memory (STM) support have significant bookkeeping overheads. Additionally, because of the speculative nature of transactions, aborts can(More)