Critical lock analysis: Diagnosing critical section bottlenecks in multithreaded applications

@article{Chen2012CriticalLA,
  title={Critical lock analysis: Diagnosing critical section bottlenecks in multithreaded applications},
  author={Guancheng Chen and Per Stenstr{\"o}m},
  journal={2012 International Conference for High Performance Computing, Networking, Storage and Analysis},
  year={2012},
  pages={1-11}
}
  • Guancheng Chen, P. Stenström
  • Published 10 November 2012
  • Business
  • 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
Critical sections are well known potential performance bottlenecks in multithreaded applications and identifying the ones that inhibit scalability are important for performance optimizations. [] Key Method Our method firstly identifies the critical sections appearing on the critical path, and then quantifies the impact of such critical sections on the overall performance by using quantitative performance metrics. Case studies show that our method can successfully identify critical sections that are most…
ParaShares: Finding the Important Basic Blocks in Multithreaded Programs
TLDR
This work ignores any underlying pathologies, and focuses instead on pinpointing the exact locations in source code that consume the largest share of execution, resulting in a new metric, ParaShares, that scores and ranks all basic blocks in a program based on their share of parallel execution.
Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs
  • Y. YaoZhonghai Lu
  • Computer Science
    2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)
  • 2016
TLDR
Experimental results show that the proposed software-hardware cooperative mechanism can effectively increase the opportunity of threads entering the critical section in low-overhead spinning phase, reducing the competition overhead averagely and accelerating the execution of the Region-of-Interest averagely across all 25 benchmark programs.
A novel technique for atomic instructions functional verification using lock contention analysis
Nowadays, it is widely acknowledged that symmetric multi-processing (SMP) must use a set of synchronization mechanisms to achieve the results, which are free of race conditions and therefore
Classifying Performance Bottlenecks in Multi-threaded Applications
TLDR
This paper proposes a model to identify scaling bottlenecks of multi-threaded applications which is based on linear regression and demonstrates the practical usefulness of the model by applying it to benchmark multi- threaded applications.
SyncProf: detecting, localizing, and optimizing synchronization bottlenecks
TLDR
The results show that SyncProf effectively localizes the root causes of these bottlenecks with higher precision than a state of the art lock contention profiler and that it suggests valuable strategies to avoid the bottlenekks.
iNPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores
  • Y. YaoZhonghai Lu
  • Computer Science
    2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • 2018
TLDR
In-network packet generation (iNPG) is proposed to turn passive "normal" NoC routers which only transmit packets into active "big" ones that can generate packets, which can significantly reduce competition overhead in various locking primitives.
Optimizing Trace Tool-overhead for Lock-Intensive Multi-threaded Parallel Applications
  • Ajit SinghP. Chakraborty
  • Computer Science
    2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC)
  • 2020
TLDR
The proposed Mutexis, an optimized user-level dynamic binary instrumentation (DBI) tracing PIN tool is developed, and the proposed tool's overhead with tool-overhead of other researchers tools is compared.
SyncPerf: Categorizing, Detecting, and Diagnosing Synchronization Performance Bugs
TLDR
Low overhead, better coverage, and informative reports make SyncPerf an effective tool to find synchronization performance bugs in the production environment.
A fast causal profiler for task parallel programs
TLDR
TASKPROF is a profiler that identifies parallelism bottlenecks in task parallel programs and leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program.
Power aware parallel computing on asymmetric multiprocessor
TLDR
This paper presents a low power hardware based technique to calculate scores in order to determine critical threads and uses the score metric to create stacks that break total execution time into each thread's score components which makes it visually easier to determine optimization opportunities.
...
...

References

SHOWING 1-10 OF 28 REFERENCES
A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors
TLDR
A new methodology for tuning performance of parallel programs focuses on the critical sections used to assure exclusive access to critical resources and data structures, proposing a specific dynamic characterization of every critical section to measure the lock contention, measure the degree of data sharing in consecutive executions, and break down the execution time.
Scalable Critical-Path Based Performance Analysis
TLDR
A set of compact performance indicators are defined that help answer a variety of important performance-analysis questions, such as identifying load imbalance, quantifying the impact of imbalance on runtime, and characterizing resource consumption.
Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors
TLDR
This paper proposes and evaluates simple but effective thread criticality predictors for parallel applications and shows that accurate predictors can be built using counters that are typically already available on-chip, and demonstrates two applications of the predictor.
Full-System Critical Path Analysis
TLDR
This paper presents a novel technique for applying critical-path analysis to complex systems composed of numerous interacting state machines, and applies it to analyzing network performance, and shows that it is able to find performance bottlenecks in both hardware and software.
Speculative lock reordering: optimistic out-of-order execution of critical sections
TLDR
It is shown that SLR can be implemented in a chip-multiprocessor by only modest extensions to already published thread-level data dependence speculation systems, and since an execution order can be selected that removes as many data dependences as possible, it can expose more concurrency.
Exploring the limits of disjoint access parallelism
TLDR
The design and architecture of a prototype tool that provides insights about critical sections is described, based on the Pin binary rewriting engine and works on unmodified x86 binaries that considers both the amount of contention for a particular lock as well as the potential amount of disjoint access parallelism.
Critical Path Profiling of Message Passing and Shared-Memory Programs
TLDR
A runtime, nontrace-based algorithm to compute the critical path profile of the execution of message passing and shared-memory parallel programs and a variant of critical path zeroing that measures the reduction in application execution time that improving a selected procedure will have is introduced.
The SPLASH-2 programs: characterization and methodological considerations
TLDR
This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Rapid identification of architectural bottlenecks via precise event counting
TLDR
New methods that enable precise, lightweight interfacing to on-chip performance counters are described, which allow precise reading of virtualized counters in low tens of nanoseconds, which is one to two orders of magnitude faster than current access techniques.
Modeling critical sections in Amdahl's law and its implications for multicore design
TLDR
It is shown that parallel performance is not only limited by sequential code but is also fundamentally limited by synchronization through critical sections, and the surprising result that the impact of critical sections on parallel performance can be modeled as a completely sequential part and a completely parallel part.
...
...