#### Filter Results:

#### Publication Year

1990

2008

#### Publication Type

#### Co-author

#### Publication Venue

#### Key Phrases

Learn More

We have designed a radix sort algorithm for vector mul-tiprocessors and have implemented the algorithm on the CRAY Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve an additional speedup of almost 5, yielding a routine over… (More)

We h a ve implemented three parallel sorting algorithms on the Connection Machine Supercom-puter model CM-2: Batcher's bitonic sort, a parallel radix sort, and a sample sort similar to Reif and Valiant's ashsort. We h a ve also evaluated the implementation of many other sorting algorithms proposed in the literature. Our computational experiments show that… (More)

This paper describes an optimized implementation of a set of <italic>scan</italic> (also called all-prefix-sums) primitives on a single processor of a CRAY Y-MP, and demonstrates that their use leads to greatly improved performance for several applications that cannot be vectorized with existing compiler technology. The algorithm used to implement the scans… (More)

Solution of partial differential equations by either the finite element or the finite difference methods often requires the solution of large, sparse linear systems. When the coefficient matrices associated with these linear systems are symmetric and positive definite, the systems are often solved iteratively using the preconditioned conjugate gradient… (More)

This manual is a supplement to the language deenition of Nesl version 3.1. It describes how to use the Nesl system interactively and covers features for accessing on-line help, debugging, prooling, executing programs on remote machines, using Nesl with GNU Emacs, and installing and customizing the Nesl system.

Cvl is a library of low-level vector routines callable from C. This library includes a wide variety of vector operations such as elementwise function applications, scans, reduces and permutations. Most Cvl routines are deened for segmented and unsegmented vectors. This paper is intended for Cvl users and implementors, and assumes familiarity with vector… (More)

This report introduces VCODE, an intermediate language for data-parallel computations. VCODE is designed to allow easy porting of data-parallel languages, such as C*, PARALATION LISP, and Fortran 8x, to a wide class of parallel machines. It is designed with the joint goals of being simple, expressive, and efficiently implementable. It contains about 50… (More)

In this paper we present a new technique for sparse matrix multiplication on vector multiprocessors based on the efficient implementation of a segmented sum operation. We describe how the segmented sum can be implemented on vector multiprocessors such that it both fully vectorizes within each processor and parallelizes across processors. Because of our… (More)

For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent y ears. As a result, several shared-memory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few… (More)

Current connectionist simulations require huge computational resources. We describe a neural network simulator for the IBM GF11, an experimental SIMD machine with 566 processors and a peak arithmetic performance of 11 Gigaflops. We present our parallel implementation of the backpropagation learning algorithm, techniques for increasing efficiency,… (More)