Application performance analysis and efficient execution on systems with multi-core CPUs, GPUs and MICs: a case study with microscopy image analysis

  title={Application performance analysis and efficient execution on systems with multi-core CPUs, GPUs and MICs: a case study with microscopy image analysis},
  author={George Teodoro and Tahsin M. Kurç and Guilherme Andrade and Jun Kong and Renato Ferreira and J. Saltz},
  journal={The International Journal of High Performance Computing Applications},
  pages={32 - 51}
  • George Teodoro, T. Kurç, J. Saltz
  • Published 1 January 2017
  • Computer Science
  • The International Journal of High Performance Computing Applications
We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core (MIC)) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexities, and parallelization forms of the operations. The results show a significant variability in the… 

Evaluating Multi-core and Many-Core Architectures through Parallelizing a High-Order WENO Solver

A systematic comparison of the three platforms in three aspects: performance, programmability, and power efficiency is conducted to facilitate the programmers to select the right platform with a suitable programming model according to their target applications.

Optimization of Data Assignment for Parallel Processing in a Hybrid Heterogeneous Environment Using Integer Linear Programming

This paper investigates a practical approach to application of integer linear programming for optimization of data assignment to compute units in a multi-level heterogeneous environment with various compute devices, including CPUs, GPUs and Intel Xeon Phis and shows that OpenCL 1.2’s device fission allows for better performance in heterogeneous CPU+GPU environments.

Classification Framework for the Parallel Hash Join with a Performance Analysis on the GPU

  • K. WozniakE. Schikuta
  • Computer Science
    2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC)
  • 2017
This work defines a taxonomy of the parallel hash join operator landscape and expects this classification framework to be a starting-point for design decisions for parallel big data hash join operators on other heterogeneous systems.

Efficient Methods and Parallel Execution for Algorithm Sensitivity Analysis with Parameter Tuning on Microscopy Imaging Datasets

The sensitivity analysis framework provides a range of strategies for the efficient exploration of the parameter space, as well as multiple indexes to evaluate the effect of parameter modification to outputs or even correlation between parameters.

Efficient Execution of Irregular Wavefront Propagation Pattern on Many Integrated Core Architecture

The objective of this study is to redesign the Irregular Wavefront Propagation Pattern algorithm in order to enable the efficient execution on processors with Many Integrated Core architecture using SIMD instructions.

Evaluating Multi-core and Many-Core Architectures through Accelerating an Alternating Direction Implicit CFD Solver

This paper accelerates a double-precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from the authors' in-house computational fluid dynamics (CFD) software on the latest multi-core and many-core architectures and systematically evaluates the programmability of the three platforms.

Parallel and Efficient Sensitivity Analysis of Microscopy Image Segmentation Workflows in Hybrid Systems

The proposed strategies to efficiently speed up SA via runtime optimizations targeting distributed hybrid systems and reuse of computations from runs with different parameters will allow the use of SA in large-scale studies.

Optimizing parameter sensitivity analysis of large‐scale microscopy image analysis workflows with multilevel computation reuse

This work proposes optimizations to reduce the overall computation cost of parameter sensitivity analysis in the context of analysis applications that segment high‐resolution slide tissue images, ie, images with resolutions of 100k × 100k pixels.

Parallelizing a high-order WENO scheme for complicated flow structures on GPU and MIC

The experiments show that the Kepler GPU offers a clear advantage in contrast to the previous Fermi GPU maintaining exactly the same source code, and while Kepler GPU can be several times faster than MIC without utilizing the increasingly available SIMD computing power on Vector Processing Unit (VPU), MIC can provide the computing capability equivalent to Kepler GPU when VPU is utilized.

Acceleration of PDE-based FTLE calculations on Intel multi-core and many-core architectures

Finite-time Lyapunov exponent (FTLE) is widely used to extract coherent structure of unsteady flow. However, the calculation of FTLE can be highly time-consuming, which greatly limits the



Coordinating the use of GPU and CPU for improving performance of compute intensive applications

This paper investigates the coordinated use of CPU and GPU to improve efficiency of applications even further than using either device independently, using Anthill runtime environment, a data-flow oriented framework in which applications are decomposed into a set of event-driven filters.

Comparative Performance Analysis of Intel (R) Xeon Phi (TM), GPU, and CPU: A Case Study from Microscopy Image Analysis

This work systematically implement and evaluate the performance of operations on modern CPUs, GPUs, and MIC systems for a microscopy image analysis application, and identifies the data access and computation patterns of operations in the object segmentation and feature computation categories.

Accelerating Large Scale Image Analyses on Parallel, CPU-GPU Equipped Systems

In the context of feature computations in large scale image analysis applications, evaluations show that intelligently co-scheduling CPUs and GPUs can significantly improve performance over GPU-only or multi-core CPU-only approaches.

Porting irregular reductions on heterogeneous CPU-GPU configurations

A Multi-level Partitioning Framework is developed that supports GPU execution of irregular reductions even when the dataset size exceeds the size of the device memory, and it can enable pipelining of partitioning performed on the CPU, and the computations on the GPU.

Optimizing dataflow applications on heterogeneous environments

This work shows that making use of all of the heterogeneous computing resources can significantly improve application performance, and nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster.

Productive Programming of GPU Clusters with OmpSs

This work presents the implementation of OmpSs for clusters of GPUs, which supports asynchrony and heterogeneity for task parallelism based on annotating a serial application with directives that are translated by the compiler.

High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms

  • George TeodoroT. Pan J. Saltz
  • Computer Science
    2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • 2013
An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4K×4K-pixel image tiles in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system.

Run-time optimizations for replicated dataflows on heterogeneous environments

This work shows that making use of all of the heterogeneous computing resources can significantly improve application performance, and nearly doubles the performance of the GPU-only implementation on a distributed heterogeneous accelerator cluster.

Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Adaptive mapping is proposed, a fully automatic technique to map computations to processing elements on a CPU+GPU machine and it is shown that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduced in energy consumption than static mappings on average for a set of important computation benchmarks.

Atomic Vector Operations on Chip Multiprocessors

The GLSC is proposed, which extends scatter-gather hardware to support atomic vector operations and provides an average performance improvement on a set of important RMS kernels of 54% for 4-wide SIMD.