Vectorization for SIMD architectures with alignment constraints

  title={Vectorization for SIMD architectures with alignment constraints},
  author={Alexandre E. Eichenberger and Peng Wu and Kevin O'Brien},
  booktitle={ACM-SIGPLAN Symposium on Programming Language Design and Implementation},
When vectorizing for SIMD architectures that are commonly employed by today's multimedia extensions, one of the new challenges that arise is the handling of memory alignment. Prior research has focused primarily on vectorizing loops where all memory references are properly aligned. An important aspect of this problem, namely, how to vectorize misaligned memory references, still remains unaddressed.This paper presents a compilation scheme that systematically vectorizes loops in the presence of… 

Figures and Tables from this paper

Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

A loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced and an effective heuristic algorithm is proposed.

Efficient SIMD code generation for runtime alignment and length conversion

This paper proposes a novel technique to simdize loops with runtime alignment nearly as efficiently as those with compile-time misalignment, and incorporates length conversion operations, e.g., conversions between data of different sizes, into the alignment handling framework.

Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware

An auto-vectorization compiler is developed which utilizes special memory access hardware for improving the performance of SIMD processors; one is the split line buffer and the other is the packing buffer which solves the non-aligned memory access problem, while the latter simplifies irregular and stride data access.

Auto-vectorization of interleaved data for SIMD

This work demonstrates an automatic compilation scheme that supports effective vectorization in the presence of interleaved data with constant strides that are powers of 2, facilitating data reorganization.

Optimizing data permutations for SIMD devices

A strategy to optimize all forms of data permutations is presented and it is shown that up to 77% of the permutation instructions are eliminated and, as a result, the average performance improvement is 48% on VMX and 68% on SSE2.

Compiling vector programs for simd devices

VINCI, or Vector I-code Novel Compilation Infrastructure, is proposed in this thesis and focuses on translating vector programs into efficient code for SIMD devices, achieving near perfect speedups on VMX and SSE2 platforms.

Performance Impact of Misaligned Accesses in SIMD Extensions

This paper evaluating the advantages and disadvantages of different techniques to avoid misaligned memory accesses such as replication of data in memory, padding of data structures, loop peeling, and shift instructions shows that the MMX implementation of the FIR filter using replication ofdata is up to 2.20 times faster than theMMX implementation with misaligned accesses.

Vectorization for accelerated gather/scatter and multibyte data formats

This thesis proposes and evaluate a technique for the generation of SIMD code to gather and scatter data elements between memory and SIMD registers, and proposes and evaluates a vectorized code generation approach which supports reducedprecision floating point number formats along a continuum between native types.

Efficient SIMD code generation for irregular kernels

This work proposes a method to generate efficient SIMD code for loops containing indirected memory references using inter- and intra-iteration parallelism, and optimally place data reorganization code in order to amortize the reorganization overhead through the performance gain of SIMD vectorization.

An integrated simdization framework using virtual vectors

This paper proposes aSimdization framework that addresses several orthogonal aspects of simdization, such as alignment handling, simdized of loops with mixed data lengths, and SIMD parallelism extraction from different program scopes (from basic blocks to inner loops).



A Vectorizing Compiler for Multimedia Extensions

An implementation of a vectorizing C compiler for Intel's MMX (Multimedia Extension) using the Stanford University Intermediate Format (SUIF), a public domain compiler tool, to enhance the scope for application of the subword semantics.

Compiler-controlled caching in superword register files for multimedia extension architectures

An algorithm and implementation of locality optimizations for architectures with instruction sets that support operations on superwords, i.e., aggregate objects consisting of several machine words that treats the large superword register file as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword registers.

Automatic Intra-Register Vectorization for the Intel® Architecture

A detailed overview of the automatic vectorization methods used by the high-performance Intel® C++/Fortran compiler together with an experimental validation of their effectiveness are provided.

Increasing and detecting memory address congruence

Methods for forcing congruence among the dynamic addresses of a memory reference and a compiler algorithm for detecting this property are presented, which can be incorporated into real compilation systems.

An empirical study on the vectorization of multimedia applications for multimedia extensions

  • Gang RenPeng WuD. Padua
  • Computer Science
    19th IEEE International Parallel and Distributed Processing Symposium
  • 2005
This study conducted an empirical study on the vectorization of media processing programs for multimedia extensions and proposed several techniques to address several new issues that are not handled by traditional vectorizers.

Compilation Techniques for Multimedia Processors

Preliminary experimental results for a code generator for the UltraSPARC VIS instruction set show that speedups of up to a factor of 4.8 are possible, and that vectorization by unrolling is much simpler but as effective as classical vectorization.

Exploiting superword level parallelism with multimedia instruction sets

This paper has developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests, and is able to exploit parallelism both across loop iterations and within basic blocks.

Simple vector microprocessors for multimedia applications

This paper demonstrates that a 2- way, in-order vector processor with a vector length of 64 and a vector width of 8 requires no more die area, and possibly significantly less area, than a 4-way, out-of-order superscalar processor with short vector extensions, and shows that the simple long vector processor is, on average, 2.7 times faster executing multimedia applications than the superscalars.

Modeling Data-Parallel Programs with the Alignment-Distribution Graph

An intermediate representation of a program called the Alignment-Distribution Graph that exposes the communication requirements of the program and serves as the basis for algorithms that map the array data and program computation to the nodes of a distributed-memory parallel computer so as to minimize completion time.

Automatic translation of FORTRAN programs to vector form

The theoretical background is developed here for employing data dependence to convert FORTRAN programs to parallel form and transformations that use dependence to uncover additional parallelism are discussed.