Lock cohorting: a general technique for designing NUMA locks

@inproceedings{Dice2012LockCA,
  title={Lock cohorting: a general technique for designing NUMA locks},
  author={David Dice and Virendra J. Marathe and Nir Shavit},
  booktitle={PPoPP '12},
  year={2012}
}
Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines' non-uniform memory and caching hierarchy, ever more important. This paper presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful. Lock cohorting allows one to transform any spin-lock algorithm, with minimal non-intrusive changes, into scalable NUMA-aware spin-locks. Our… 

Figures and Tables from this paper

Lock Cohorting

This article presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful, and allows one to transform any spin-lock algorithm, with minimal nonintrusive changes, into a scalable NUma-aware spin-locks.

Efficient Abortable-locking Protocol for Multi-level NUMA Systems

The design and implementation of the HMCS-T lock is described, a Hierarchical MCS (HMCS) lock variant that admits timeout that maintains the locality benefits of HMCS while ensuring aborts are lightweight.

An Efficient Abortable-locking Protocol for Multi-level NUMA Systems

The HMCS-T lock is designed and evaluated, a Hierarchical MCS (HMCS) lock variant that admits a timeout that maintains the locality benefits of HMCS while ensuring aborts to be lightweight and offers the progress guarantee missing in most abortable queuing locks.

Scalable and practical locking with shuffling

A new technique, shuffling, is proposed that can dynamically accommodate NUMA-awareness and implement an efficient parking/wake-up strategy, without any auxiliary data structure, mostly off the critical path of the lock.

Compact NUMA-aware Locks

This work presents a compact NUMA-aware lock that requires only one word of memory, regardless of the number of sockets in the underlying machine, and implemented the new lock in user-space as well as integrated it in the Linux kernel's qspinlock, one of the major synchronization constructs in the kernel.

CLoF: A Compositional Lock Framework for Multi-level NUMA Systems

In the evaluation, CLoF locks outperform state-of-the-art NUMA-aware locks in most scenarios, e.g., in a highly contended LevelDB benchmark, the best CLof locks yield twice the throughput achieved with CNA lock and ShflLock on large x86 and Armv8 servers.

A NUMA-Aware Recoverable Mutex Lock

The Recoverable Filter (RF) lock is proposed, a black-box transformation approach that exploits memory locality to transform a NUMA-oblivious recoverable mutex lock into a N UMA-aware one.

High performance locks for multi-level NUMA systems

A hierarchical variant of the MCS lock that adapts the principles of cohort locking for architectures with deep NUMA hierarchies and analytical models for throughput and fairness of Cohort-MCS and Hierarchical MCS locks that enable us to tailor these locks for high performance on any target platform without empirical tuning are described.

Scalable adaptive NUMA-aware lock: combining local locking and remote locking for efficient concurrency

This work proposes SANL, a locking scheme that can deliver high performance under various contention levels by adaptively switching between the local and theRemote lock scheme, and introduces a new NUMA policy for the remote lock that jointly considers node distances and server utilization when choosing lock servers.

NUMA-aware reader-writer locks

This paper presents what is, to the best of the knowledge, the first family of reader-writer lock algorithms tailored to NUMA architectures, and presents several variations which trade fairness between readers and writers for higher concurrency among readers and better back-to-back batching of writers from the same N UMA node.
...

References

SHOWING 1-10 OF 14 REFERENCES

Scalable queue-based spin locks with timeout

It is demonstrated that it is possible to obtain both scalability and bounded waiting, using variants of the queue-based locks of Craig, Landin, and Hagersten, and of Mellor-Crummey and Scott.

A Hierarchical CLH Queue Lock

In a set of microbenchmarks run on a large scale multiprocessor machine and a state-of-the-art multi-threaded multi-core chip, the HLCH algorithm exhibits better performance and significantly better fairness than the hierarchical backoff locks of Radovic and Hagersten.

Flat-combining NUMA locks

This paper presents a novel scalable hierarchical queue-lock algorithm based on the flat combining synchronization paradigm that significantly outperforms all classic locking algorithms, and at high concurrency levels, provides up to a factor of two improvement over HCLH, the most efficient known hierarchical locking algorithm.

Non-blocking timeout in scalable queue-based spin locks

New queue-based locks in which the timeout code is non-blocking are presented, which sacrifice the constant worst-case space per thread of previous algorithms, but allow us to bound the time that a thread may be delayed by preemption of its peers.

Hierarchical backoff locks for nonuniform communication architectures

  • Z. RadovicErik Hagersten
  • Computer Science
    The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.
  • 2003
This paper proposes a set of simple software-based hierarchical backoff locks (HBO) that create node affinity in NUCA and are shown to be very competitive for uncontested locks while being more than twice as fast for contended locks.

Mostly lock-free malloc

Multi-Processor Restartable Critical Sections permits user-level threads to know precisely which processor they are executing on and then to safely manipulate CPU-specific data, such as malloc metadata, without locks or atomic instructions.

Building FIFO and Priority-Queuing Spin Locks from Atomic Swap

The main technical contributions are techniques and algorithms that provide tight control over lock grant order, use only the atomic swap instruction, use at most one spin for lock acquisition and no spinning for lock release, and need only O(L + P) space on either a coherent-cache or NUMA machine.

Algorithms for scalable synchronization on shared-memory multiprocessors

The principal conclusion is that contention due to synchronization need not be a problemin large-scale shared-memory multiprocessors, and the existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides protection against so-called “dance hall” architectures.

Adaptive backoff synchronization techniques

This work proposes a class of adaptive backoff methods that do not use any extra hardware and can significantly reduce the memory traffic to synchronization variables, and uses synchronization state to reduce polling of synchronization variables.

The Art of Multiprocessor Programming