Theoretical peak FLOPS per instruction set: a tutorial

  title={Theoretical peak FLOPS per instruction set: a tutorial},
  author={Romain Dolbeau},
  journal={The Journal of Supercomputing},
  • R. Dolbeau
  • Published 1 March 2018
  • Computer Science
  • The Journal of Supercomputing
Traditionally, evaluating the theoretical peak performance of a CPU in FLOPS (floating-point operations per second) was merely a matter of multiplying the frequency by the number of floating-point instructions per cycle. Today however, CPUs have features such as vectorization, fused multiply-add, hyperthreading, and “turbo” mode. In this tutorial, we look into this theoretical peak for recent fully featured Intel CPUs and other hardware, taking into account not only the simple absolute peak… 
High‐performance SIMD modular arithmetic for polynomial evaluation
This article shows how to leverage SIMD (single instruction, multiple data) computing for modular arithmetic on AVX2 and AVX‐512 units, using both intrinsics and OpenMP compiler directives, and exploits instruction‐level parallelism to increase the compute efficiency of polynomial evaluations.
Evolving Requirements and Trends of HPC
High-performance computing (HPC) denotes the design, build or use of computing systems substantially larger than typical desktop or laptop computers, in order to solve problems that are unsolvable on
E-OSched: a load balancing scheduler for heterogeneous multicores
The results revealed that the proposed E-OSched has performed significantly well than the state-of-the-art scheduling heuristics by obtaining up to 8.09% improved execution time and up to 7.07% better throughput.
Dense and sparse parallel linear algebra algorithms on graphics processing units
This thesis studies the use of graphics processing units as computer accelerators and applies it to the field of linear algebra, and implemented several algorithms to solve linear systems of equations for the specific case of matrices with a block-tridiagonal structure that are run on GPU.
Spectral Element Simulations on the NEC SX-Aurora TSUBASA
This paper introduces a new implementation of Nek5000’s gather-scatter library with mesh topology awareness for improved vectorization via exploitation of the SX-Aurora's hardware gather- scatter instructions, improving performance with up to 116%.
Optimization of Finite-Differencing Kernels for Numerical Relativity Applications
A simple optimization strategy for the computation of 3D finite-differencing kernels on many-cores architectures provides substantial speedup in computations involving tensor contractions and 3D stencil calculations on different processor microarchitectures, including Intel Knight Landing.
Sparse approximate matrix multiplication in a fully recursive distributed task-based parallel framework
This paper considers parallel implementations of approximate multiplication of large matrices with exponential decay of elements in computations related to electronic structure calculations and some other fields of science, and the absolute error asymptotic behavior is derived.
TOP-Storm: A topology-based resource-aware scheduler for Stream Processing Engine
T TOP-Storm—a scheduler based on topology’s DAG (Directed Acyclic Graph) is proposed for Apache Storm (a popular open-source SPE) that optimize resource usage for heterogeneous clusters and improves efficiency using resource-aware task assignments that results in enhanced throughput and optimize resource utilization.
Power Consumption and Delay in Wired Parts of Fog Computing Networks
This work addresses latency and power consumption in Fog computing networks by modeling multiple architectures and using various network scenarios to tackle the balance between Fog and Cloud.


The PA-8000 RISC CPU is the first of a new generation of Hewlett-Packard microprocessors designed for high-end systems, and features an aggressive, four-way, superscalar implementation, combining speculative execution with on-the-fly instruction reordering.
Bulldozer: An Approach to Multithreaded Compute Performance
The module multithreading architecture, power-efficient micro architecture, and subblocks, including the various microarchitectural latencies, bandwidths, and structure sizes are discussed.
Analysis of high-performance floating-point arithmetic on FPGAs
The impact of floating-point units on the design of an energy efficient architecture for the matrix multiply kernel is discussed and it is shown that FPGAs are capable of achieving up to 6x improvement in terms of the GFLOPS/W metric over that of general purpose processors.
Design of the IBM RISC System/6000 Floating-Point Execution Unit
The RS/6000 FPU is a unified floating-point multiply-add-fused unit (MAF) which performs the accumulate operation as an indivisible operation, which reduces the latency for chained floating- point operations, as well as rounding errors and chip busing.
ADAPTEVA : MORE FLOPS , LESS WATTS Epiphany Offers Floating-Point Accelerator for Mobile Processors
The tiny startup has developed and tested a unique architecture that delivers industry-leading flops per watt and offers its Epiphany multicore architecture as an intellectual-property (IP) core that scales to various performance levels.
CASH: Revisiting Hardware Sharing in Single-Chip Parallel Processors
The CASH architecture shows that there exists intermediate design points between CMP and SMT, and outperforms a similar CMP on a multiprogrammed workload, as well as on a uniprocess workload.
Tradeoff of FPGA Design of a Floating-point Library for Arithmetic Operators
A parameterizable floating-point library for arithmetic operators based on FPGAs was implemented and a tradeoff analysis of the hardware implementation was performed, which enables the designer to choose the suitable bit-width representation and error associated, as well as the area cost, elapsed time and power consumption for each arithmetic operator.
The Tera computer system
Multi-processor Performance on the Tera MTA
A preliminary investigation of the first multi-processor Tera MTA, finding that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator.
Factored multi-core architectures
This work proposes an architecture where the large structures and latency tolerant performance accelerators are factored out of the processor core into helpers and the small and fast μ-core can be augmented with these latency tolerant helpers, and investigates activity migration (core swapping) as a means of controlling the thermal profile of the chip.