Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms

  title={Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms},
  author={Gaurav Mitra and Beau Johnston and Alistair P. Rendell and Eric C. McCreath and J. Zhou},
  journal={2013 IEEE International Symposium on Parallel \& Distributed Processing, Workshops and Phd Forum},
  • Gaurav MitraBeau Johnston J. Zhou
  • Published 1 May 2013
  • Computer Science
  • 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum
Augmenting a processor with special hardware that is able to apply a Single Instruction to Multiple Data(SIMD) at the same time is a cost effective way of improving processor performance. It also offers a means of improving the ratio of processor performance to power usage due to reduced and more effective data movement and intrinsically lower instruction counts. This paper considers and compares the NEON SIMD instruction set used on the ARM Cortex-A series of RISC processors with the SSE2 SIMD… 

Figures and Tables from this paper

Mixed-length SIMD code generation for VLIW architectures with multiple native vector-widths

This paper researches and implements the mixed-length SIMD code generation support for SHAVE processor and improves the performance of compiler generatedSIMD code by reducing the number of overhead operations and by increasing the SIMD hardware utilization.

A compilation technique and performance profits for VLIW with heterogeneous vectors

This paper proposes the usage of VLIW processors with multiple native vector-widths to better serve applications with changing DLP and improves the performance of compiler generated SIMD code by reducing the number of overhead operations.

Development and Application of a Hybrid Programming Environment on an ARM/DSP System for High Performance Computing

A hybrid programming environment that combines OpenMP, OpenCL and MPI to enable application execution across multiple Brown-Dwarf nodes is demonstrated and results indicate that the Brown-dwarf system remains competitive with contemporary systems for memory-bound computations.

Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture

Issues and challenges encountered while migrating the matrix multiplication (GEMM) kernel, originally written only for the C6678 DSP to the ARM-DSP SoC using an early prototype of the OpenMP 4.0 accelerator model are explored.

Control Flow Vectorization for ARM NEON

This work analyzes the challenge of generating efficientvector instructions by benchmarking 151 loop patterns with three compilers on two SIMD instruction sets and proposes a set of solutions to generate efficient vector instructions in the presence of control flow.

Free Rider

A description language is employed to specify the signature and semantics of intrinsics and perform graph-based pattern matching and high-level code transformations to derive optimized implementations exploiting the target’s intrinsics, wherever possible.


A multi-threaded algorithm using the standard OpenMP threading library to parallelize the computations using two Intel multi-core processors is presented and shows a maximum attained speedup closer to the number of physical cores in the CPU, which is the maximum theoretical value.

Improving Software Productivity and Performance Through a Transparent SIMD Execution

This work proposes a transparent Dynamic SIMD Assembler (DSA) that is capable of detecting vectorizable code regions at runtime without requiring specific library or compilers.

Accelerating Pre-stack Kirchhoff Time Migration by using SIMD Vector Instructions

It is shown that a hand-written Kirchhoff code by using SIMD vector instructions is morecient than the auto-vectorized code provided by GCC and can be used together with the other ones to accelerate seismic migration methods in general without new investments in hardware and software.

Vectorization of binaural sound virtualization on the ARM Cortex-A15 architecture

This paper targets the efficient implementation of binaural sound virtualization, a heavy-duty audio processing application that can eventually require 16 convolutions to synthesize a virtual sound source, and describes a data reorganization that allows to exploit the 128-bit NEON intrinsics of an ARM Cortex-A15 core.



Performance of SSE and AVX Instruction Sets

SSE (streaming SIMD extensions) and AVX (advanced vector extensions) are SIMD (single instruction multiple data streams) instruction sets supported by recent CPUs manufactured in Intel and AMD. This

Towards High-Performance Implementations of a Custom HPC Kernel Using ® Array Building Blocks

A case study on data mining with adaptive sparse grids unveils how deterministic parallelism, safety, and runtime optimization make Intel ArBB practically applicable.

An Evaluation of Vectorizing Compilers

Evaluated how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II shows that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers.

SIMD performance in software based mobile video coding

  • Tero RintaluomaO. Silvén
  • Computer Science
    2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation
  • 2010
This paper presents optimization methods and results from using a NEON instruction set and OpenMax DL API for MPEG-4 and H.264 video encoding and decoding for serial bit stream processing bottleneck remains to be solved.

Early performance evaluation of AVX for HPC

Sparc64 VIIIfx: A New-Generation Octocore Processor for Petascale Computing

The Sparc64 VIIIfx eight-core processor, developed for use in petascale computing systems, runs at speeds of up to 2 GHz and achieves a peak performance of 128 gigaflops while consuming as little as

Blue Gene/Q: design for sustained multi-petaflop computing

The Blue Gene/Q system represents the third generation of optimized high-performance computing Blue Gene solution servers and provides a platform for continued growth in HPC performance and capability and gives application developers a platform to develop and deploy sustained petascale computing applications.

NVIDIA cuda software and gpu parallel computing architecture

This talk will describe NVIDIA's massively multithreaded computing architecture and CUDA software for GPU computing, a scalable, highly parallel architecture that delivers high throughput for data-intensive processing.

Anatomy of a globally recursive embedded LINPACK benchmark

A novel formulation of a recursive LU factorization that is recursive and parallel at the global scope is used that presents an alternative to existing linear algebra parallelization techniques such as master-worker and DAG-based approaches.

Using mobile GPU for general-purpose computing – a case study of face recognition on smartphones

  • K. ChengYi-Chu Wang
  • Computer Science
    Proceedings of 2011 International Symposium on VLSI Design, Automation and Test
  • 2011
This work uses face recognition as an application driver for face recognition and implementations on a smartphone reveals that, utilizing the mobile GPU as a co-processor can achieve significant speedup in performance as well as substantial reduction in total energy consumption, in comparison with a mobile-CPU-only implementation on the same platform.