Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms
@article{Mitra2013UseOS, title={Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms}, author={Gaurav Mitra and Beau Johnston and Alistair P. Rendell and Eric C. McCreath and J. Zhou}, journal={2013 IEEE International Symposium on Parallel \& Distributed Processing, Workshops and Phd Forum}, year={2013}, pages={1107-1116} }
Augmenting a processor with special hardware that is able to apply a Single Instruction to Multiple Data(SIMD) at the same time is a cost effective way of improving processor performance. It also offers a means of improving the ratio of processor performance to power usage due to reduced and more effective data movement and intrinsically lower instruction counts. This paper considers and compares the NEON SIMD instruction set used on the ARM Cortex-A series of RISC processors with the SSE2 SIMD…
Figures and Tables from this paper
86 Citations
Mixed-length SIMD code generation for VLIW architectures with multiple native vector-widths
- Computer Science2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)
- 2015
This paper researches and implements the mixed-length SIMD code generation support for SHAVE processor and improves the performance of compiler generatedSIMD code by reducing the number of overhead operations and by increasing the SIMD hardware utilization.
A compilation technique and performance profits for VLIW with heterogeneous vectors
- Computer Science2015 4th Mediterranean Conference on Embedded Computing (MECO)
- 2015
This paper proposes the usage of VLIW processors with multiple native vector-widths to better serve applications with changing DLP and improves the performance of compiler generated SIMD code by reducing the number of overhead operations.
Development and Application of a Hybrid Programming Environment on an ARM/DSP System for High Performance Computing
- Computer Science2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2018
A hybrid programming environment that combines OpenMP, OpenCL and MPI to enable application execution across multiple Brown-Dwarf nodes is demonstrated and results indicate that the Brown-dwarf system remains competitive with contemporary systems for memory-bound computations.
Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture
- Computer ScienceIWOMP
- 2014
Issues and challenges encountered while migrating the matrix multiplication (GEMM) kernel, originally written only for the C6678 DSP to the ARM-DSP SoC using an early prototype of the OpenMP 4.0 accelerator model are explored.
Control Flow Vectorization for ARM NEON
- Computer ScienceSCOPES
- 2018
This work analyzes the challenge of generating efficientvector instructions by benchmarking 151 loop patterns with three compilers on two SIMD instruction sets and proposes a set of solutions to generate efficient vector instructions in the presence of control flow.
Free Rider
- Computer ScienceACM Trans. Embed. Comput. Syst.
- 2017
A description language is employed to specify the signature and semantics of intrinsics and perform graph-based pattern matching and high-level code transformations to derive optimized implementations exploiting the target’s intrinsics, wherever possible.
PROCESSORS USING OPENMP LIBRARY
- Computer Science
- 2016
A multi-threaded algorithm using the standard OpenMP threading library to parallelize the computations using two Intel multi-core processors is presented and shows a maximum attained speedup closer to the number of physical cores in the CPU, which is the maximum theoretical value.
Improving Software Productivity and Performance Through a Transparent SIMD Execution
- Computer Science2018 31st Symposium on Integrated Circuits and Systems Design (SBCCI)
- 2018
This work proposes a transparent Dynamic SIMD Assembler (DSA) that is capable of detecting vectorizable code regions at runtime without requiring specific library or compilers.
Accelerating Pre-stack Kirchhoff Time Migration by using SIMD Vector Instructions
- Computer Science
- 2021
It is shown that a hand-written Kirchhoff code by using SIMD vector instructions is morecient than the auto-vectorized code provided by GCC and can be used together with the other ones to accelerate seismic migration methods in general without new investments in hardware and software.
Vectorization of binaural sound virtualization on the ARM Cortex-A15 architecture
- Computer Science2015 23rd European Signal Processing Conference (EUSIPCO)
- 2015
This paper targets the efficient implementation of binaural sound virtualization, a heavy-duty audio processing application that can eventually require 16 convolutions to synthesize a virtual sound source, and describes a data reorganization that allows to exploit the 128-bit NEON intrinsics of an ARM Cortex-A15 core.
References
SHOWING 1-10 OF 36 REFERENCES
Performance of SSE and AVX Instruction Sets
- Computer ScienceArXiv
- 2012
SSE (streaming SIMD extensions) and AVX (advanced vector extensions) are SIMD (single instruction multiple data streams) instruction sets supported by recent CPUs manufactured in Intel and AMD. This…
Towards High-Performance Implementations of a Custom HPC Kernel Using ® Array Building Blocks
- Computer ScienceFacing the Multicore-Challenge
- 2011
A case study on data mining with adaptive sparse grids unveils how deterministic parallelism, safety, and runtime optimization make Intel ArBB practically applicable.
An Evaluation of Vectorizing Compilers
- Computer Science2011 International Conference on Parallel Architectures and Compilation Techniques
- 2011
Evaluated how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II shows that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers.
SIMD performance in software based mobile video coding
- Computer Science2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation
- 2010
This paper presents optimization methods and results from using a NEON instruction set and OpenMax DL API for MPEG-4 and H.264 video encoding and decoding for serial bit stream processing bottleneck remains to be solved.
Sparc64 VIIIfx: A New-Generation Octocore Processor for Petascale Computing
- Computer ScienceIEEE Micro
- 2010
The Sparc64 VIIIfx eight-core processor, developed for use in petascale computing systems, runs at speeds of up to 2 GHz and achieves a peak performance of 128 gigaflops while consuming as little as…
Blue Gene/Q: design for sustained multi-petaflop computing
- Computer ScienceICS '12
- 2012
The Blue Gene/Q system represents the third generation of optimized high-performance computing Blue Gene solution servers and provides a platform for continued growth in HPC performance and capability and gives application developers a platform to develop and deploy sustained petascale computing applications.
NVIDIA cuda software and gpu parallel computing architecture
- Computer ScienceISMM '07
- 2007
This talk will describe NVIDIA's massively multithreaded computing architecture and CUDA software for GPU computing, a scalable, highly parallel architecture that delivers high throughput for data-intensive processing.
Anatomy of a globally recursive embedded LINPACK benchmark
- Computer Science2012 IEEE Conference on High Performance Extreme Computing
- 2012
A novel formulation of a recursive LU factorization that is recursive and parallel at the global scope is used that presents an alternative to existing linear algebra parallelization techniques such as master-worker and DAG-based approaches.
Using mobile GPU for general-purpose computing – a case study of face recognition on smartphones
- Computer ScienceProceedings of 2011 International Symposium on VLSI Design, Automation and Test
- 2011
This work uses face recognition as an application driver for face recognition and implementations on a smartphone reveals that, utilizing the mobile GPU as a co-processor can achieve significant speedup in performance as well as substantial reduction in total energy consumption, in comparison with a mobile-CPU-only implementation on the same platform.