A Practical Approach to DOACROSS Parallelization

  title={A Practical Approach to DOACROSS Parallelization},
  author={Priya Unnikrishnan and Jun Shirako and Kit Barton and Sanjay Chatterjee and Ra{\'u}l Silvera and Vivek Sarkar},
Loops with cross-iteration dependences (doacross loops) often contain significant amounts of parallelism that can potentially be exploited on modern manycore processors. However, most production-strength compilers focus their automatic parallelization efforts on doall loops, and consider doacross parallelism to be impractical due to the space inefficiencies and the synchronization overheads of past approaches. This paper presents a novel and practical approach to automatically parallelizing… 

The Batched DOACROSS loop parallelization algorithm

  • D. C. S. LucasG. Araújo
  • Computer Science
    2015 International Conference on High Performance Computing & Simulation (HPCS)
  • 2015
A novel algorithm is presented, called Batched DOACROSS (BDX), that capitalizes on the advantages of DSWP and DOACrOSS, while minimizing their deficiencies, and is presented as a novel algorithm that can effectively speed up several programs.

Revisiting the Parallel Strategy for DOACROSS Loops

A brand-new parallel strategy for DOACROSS loops is proposed that provides a dynamic task assignment with reduced dependences to achieve wave-front parallelism through loop tiling and is comparable with a state-of-the-art TSS approach.

A Dynamic Parallel Strategy for DOACROSS Loops

A brand-new parallel strategy is proposed that achieves wave-front parallelism with reduced dependences and provides dynamic tile assignment for DOACROSS loops, which has better ability to avoid threads from waiting in synchronization and utilize computing resources.

MatDoacross parallelization using component annotation and loop-carried probability /

A new OpenMP clause is created that, used together with the ordered directive, can separate these components and implement these techniques automatically, and some factors must be taken for the choice of parallelization technique, i.e. algorithm structure, loop-carried ratio, number of loop iterations and loop size.

Analysis of hotspot methods in JVM for best-effort run-time parallelization

This work is an attempt to analyze bare minimum portions of code identified by the profiling mechanism of Java Virtual Machine for potential loop parallelism and found that function call to another method was the major hurdle in run-time parallelization.

Expressing DOACROSS Loop Dependences in OpenMP

Experimental results on a 32-core IBM Power7 system using four benchmark programs show performance improvements of the proposed doacross approach over OpenMP barriers by factors of 1.4× to 5.2× when using all 32 cores.

Polyhedral Optimizations for a Data-Flow Graph Language

This is the first system to encode explicit macro-dataflow parallelism in polyhedral representations so as to provide programmers with an easy-to-use DSL notation with legality checks, while taking full advantage of the optimization functionality in state-of-the-art polyhedral frameworks.

Performance Reduction For Automatic Development of Parallel Applications For Reconfigurable Computer Systems

The maximum number of transformations, which, according to the suggested methodology, are needed for balanced reduction of the performance and hardware costs of applications for reconfigurable  computer systems.

Polyhedral Optimizations of Explicitly Parallel Programs

This paper addresses the problem of extending polyhedral frameworks to enable analysis and transformation of programs that contain both explicit parallelism and unanalyzable data accesses and demonstrates how polyhedral transformations with the resulting dependences can further improve the performance of the manually-parallelized OpenMP benchmarks.



An efficient algorithm for the run-time parallelization of DOACROSS loops

This work presents a new scheme that handles any type of data dependence in the loop without requiring any special architectural support in the multiprocessor, and significantly reduces the amount of processor communication required and increases the overlap among dependent iterations.

Compiler optimizations for parallel loops with fine-grained synchronization

The results indicate that these schemes out-perform earlier schemes in terms of higher parallelism and lower communication requirement, and form an integral part of the future high performance parallelizing compilers.

Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs

  • D. Lowenthal
  • Computer Science
    International Journal of Parallel Programming
  • 2004
Performance on a network of workstations shows that programs that use the runtime analysis to choose the block size outperform those that use static block sizes by as much as 18% when the workload is unbalanced.

Compiler techniques for data synchronization in nested parallel loops

Using this scheme, a parallelizing compiler can parallelize a general nested loop structure with complicated cross-iteration data dependencies if the computations of ordering numbers cannot be done at compile time, and the run-time overhead is smaller than the other existing run- time schemes.

On the Interaction of Tiling and Automatic Parallelization

This paper presents an algorithm that applies tiling in concert with parallelization, and presents the first comprehensive evaluation of tiling techniques on compiler-parallelized programs.

Compilation techniques for parallel systems

Experiments with Auto-Parallelizing SPEC2000FP Benchmarks

Although the parallelization results show relatively low speed up, it is still promising considering the problems associated with explicit parallel programming and the fact that more and more multi-thread and multi-core chips will soon be available even for home computing.

On Data Synchronization For Multiprocessors

  • H. SuP. Yew
  • Computer Science
    The 16th Annual International Symposium on Computer Architecture
  • 1989
This paper classifies the synchronization schemes based on how synchronization variables are used and proposes a new scheme, the process-oriented scheme, which requires a very small number of synchronization variables and can be supported very efficiently by simple hardware in the system.

Removal of redundant dependences in DOACROSS loops with constant dependences

It is shown that unlike with single loops, in the case of nested loops, a particular dependence may be redundant at some iterations but not redundant at others, so that the redundancy of a dependence may not be uniform over the entire iteration space.

Compiler Algorithms for Synchronization

Several loop synchronization techniques to generate synchronization instructions for singly-nested loops are presented and a technique for the elimination of redundant synchronization instructions is presented.