Learn More
The lattice Boltzmann method is increasingly important in facilitating large-scale fluid dynamics simulations. To date, these simulations have been built on discretized velocity models of up to 27 neighbors. Recent work has shown that higher order approximations of the continuum Boltzmann equation enable not only recovery of the Navier-Stokes hydrodynamics,(More)
A large number of parallel applications contain a computationally intensive phase in which a large list of elements must be ordered based on some common attribute of the elements. How do we sort a sequence of elements on multiple processing units so as to minimize redistribution of keys while allowing processing units to do independent sorting work?
Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with(More)
Hybrid parallel programming with MPI for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two(More)
We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant(More)
Application performance can be degraded significantly due to node-local load imbalances during application execution. Prior work suggested using a mixed static/dynamic scheduling approach for handling this problem, specifically in the context of fine-grained, transient load imbalances. Here, we consider an alternate strategy for more general load imbalances(More)
Domain decomposition for regular meshes on parallel computers has traditionally been performed by attempting to exactly partition the work among the available processors (now cores). However, these strategies often do not consider the inherent system noise which can hinder MPI application scalability to emerging peta-scale machines with 10000+ nodes. In(More)
The NAS parallel benchmarks, originally developed by NASA for evaluating performance of their high-performance computers, have been regarded as one of the most widely used benchmark suites for side-by-side comparisons of high-performance machines. However, even though the NAS parallel benchmarks have grown tremendously in the last two decades, documentation(More)
Performance irregularities on massively parallel processors lead to load imbalances and a significant loss of performance. Multi-core nodes suggest a promising way to redistribute work within a node, thus mitigating performance irregularities. However, there exists a non-trivial cost to redistributing work, and associated data, across cores. We investigate(More)
Recent studies have shown that operating system (OS) interference, popularly called OS noise can be a significant problem as we scale to a large number of processors. One solution for mitigating noise is to turn off certain OS services on the machine. However, this is typically infeasible because full-scale OS services may be required for some applications.(More)