The Implementation and Testing of Time-Minimal and Resource-Optimal Parallel Reversal Schedules

  title={The Implementation and Testing of Time-Minimal and Resource-Optimal Parallel Reversal Schedules},
  author={U. Lehmann and A. Walther},
  booktitle={International Conference on Computational Science},
For computational purposes such as the computation of adjoint, applying the reverse mode of automatic differentiation, or debugging one may require the values computed during the evaluation of a function in reverse order. The naive approach is to store all information needed for the reversal and to read this information backwards during the reversal. This technique leads to an enormous memory requirement, which is proportional to the computing time. The paper presents an approach to reducing… Expand
Bounding the Number of Processors and Checkpoints Needed in Time-minimal Parallel Reversal Schedules
The structure of such parallel reversal schedules that use the checkpointing technique on a multi-processor machine are described and they are shown to require the least number of processors and memory locations to store checkpoints given a certain number of time steps. Expand
Parallel reversal schedules using more checkpoints than processors
This diploma thesis is an attempt to continue the research by relaxing the central assumption, such that memory for a large number of plain checkpoints can be used with a comparatively small number of processors. Expand
A-revolve: an adaptive memory-reduced procedure for calculating adjoints; with an application to computing adjoints of the instationary Navier–Stokes system
A low-storage and low-run-time approach for calculating numerical approximations of adjoint equations for the instationary Navier–Stokes equations with adaptive evaluation of the discretization step uses adaptive checkpointing. Expand
Schedules for dynamic bidirectional simulations on parallel computers
The author says he is indebted to Rachel Lichten, John Shaw and Ellen Smith who helped him to bring this thesis into shape and his parents very much for all of the lifelong support and help they have given him. Expand
Adjoint Algorithmic Differentiation Tool Support for Typical Numerical Patterns in Computational Finance
The flexibility and ease of use of C++ algorithmic differentiation (AD) tools based on overloading to numerical patterns (kernels) arising in computational finance are demonstrated. Expand
Algorithmic Differentiation of Numerical Methods : Tangent-Linear and Adjoint Solvers for Systems of Nonlinear Equations
We discuss software tool support for the Algorithmic Differentiation (also known as Automatic Differentiation; AD) of numerical simulation programs that contain calls to solvers for parameterizedExpand
Algorithmic Differentiation of Numerical Methods: Tangent and Adjoint Solvers for Parameterized Systems of Nonlinear Equations
The algorithmic formalism is developed building on prior work by other colleagues and an implementation based on the AD software dco/c++ is presented, which supports the theoretically obtained computational complexity results with practical runtime measurements. Expand
Separating language dependent and independent tasks for the semantic transformation of numerical programs
  • J. Utke, U. Naumann
  • Computer Science
  • IASTED Conf. on Software Engineering and Applications
  • 2004
Adjoint Calculation Using Time-Minimal Program Reversals for Multi-Processor Machines
A new approach to reversing program executions that runs the forward simulation and the reversal process at the same speed and illustrates the principle structure of time-minimal parallel reversal schedules and quotes the required resources. Expand


The Tera computer system
The Tera architecture was designed with several goals in mind; it needed to be suitable for very high speed implementations, i. Expand
Transactional Memory: Architectural Support For Lock-free Data Structures
  • M. Herlihy, J. Moss
  • Computer Science
  • Proceedings of the 20th Annual International Symposium on Computer Architecture
  • 1993
Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock. Expand
Synchronization and communication in the T3E multiprocessor
The T3E augments the memory interface of the DEC 21164 microprocessor with a large set of explicitly-managed, external registers (E-registers), which provide a rich set of atomic memory operations and a flexible, user-level messaging facility. Expand
The SPLASH-2 programs: characterization and methodological considerations
This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality. Expand
A "flight data recorder" for enabling full-system multiprocessor deterministic replay
A practical low-overhead hardware recorder for cachecoherent multiprocessors, called Flight Data Recorder (FDR), which like an aircraft flight data recorder continuously records the execution, even on deployed systems, logging the execution for post-mortem analysis. Expand
Design, implementation and testing of extended and mixed precision BLAS
The design rationale, a C implementation, and conformance testing of a subset of the new Standard for the BLAS (Basic Linear Algebra Subroutines): Extended and Mixed Precision BLAS are described, which achieves excellent performance. Expand
Transactional Memory Coherence and Consistency ( TCC )
The Transactional memory Coherence and Consistency (TCC) provides a shared memory model in which atomic transactions are always the basic unit of parallel work, communication, memory coherence, andExpand
MPI: The Complete Reference
MPI: The Complete Reference is an annotated manual for the latest 1.1 version of the standard that illuminates the more advanced and subtle features of MPI and covers such advanced issues in parallel computing and programming as true portability, deadlock, high-performance message passing, and libraries for distributed and parallel computing. Expand
Superoptimizer: a look at the smallest program
Given an instruction set, the superoptimizer finds the shortest program to compute a function, a probabilistic test that makes exhaustive searches practical for programs of useful size. Expand
Automated Task Allocation for Network Processors
Network processors have great potential to combine high performance with increased flexibility. These multiprocessor systems consist of programmable elements, dedicated logic, and specialized memoryExpand