Mostly lock-free malloc

@inproceedings{Dice2002MostlyLM,
  title={Mostly lock-free malloc},
  author={David Dice and Alex Garthwaite},
  booktitle={ISMM '02},
  year={2002}
}
Modern multithreaded applications, such as application servers and database engines, can severely stress the performance of user-level memory allocators like the ubiquitous malloc subsystem. [] Key Method MP-RCS avoids interference by using upcalls to notify user-level threads when preemption or migration has occurred. The upcall will abort and restart any interrupted critical sections.We use MP-RCS to implement a malloc package, LFMalloc (Lock-Free Malloc). LFMalloc is scalable, has extremely low latency…

Figures and Tables from this paper

SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores
TLDR
A new dynamic memory allocator for multi-threaded applications that is completely synchronization-free when a thread allocates a memory block and deal locates it by itself, and is highly scalable with a large number of threads.
Supporting per-processor local-allocation buffers using multi-processor restartable critical sections
TLDR
To support processor-specific transactions in dynamically generated code, a novel mechanism for implementing critical sections is developed that is efficient, allows preemption-notification at known points in a given critical section, and does not require explicit registration of the critical sections.
A Fast Lock-Free User Memory Space Allocator for Embedded Systems
TLDR
This paper introduces the allocator, lock-free and scalable that free the synchronization cost and low the fragmentation, and improves overall program performance over the standard Linux allocator by up to a factor of 60 on 32 threads, and up-to-a factor of 10 over the next best allocator the authors tested.
Supporting per-processor local-allocation buffers using lightweight user-level preemption notification
TLDR
A novel mechanism for implementing critical sections for processor-specific transactions in dynamically generated code is developed that is efficient, allows preemption-notification at known points in a given critical section, and does not require explicit registration of the critical sections.
Hierarchical PLABs , CLABs , TLABs in Hotspot
TLDR
HABs are introduced and present a three-level HAB implementation with processorand core-local allocation buffers (PLABs, CLABs) in between the global heap and TLABs and show improved performance for a memory-allocation-intensive benchmark.
wfspan: wait-free dynamic memory management
TLDR
Decentralized dynamic memory management, wfspan, based on non-linearizable wait-free lists is presented, which guarantees bounded execution steps in both allocation and deallocation procedure, at the cost of increasing bounded worst-case memory footprint.
NBBS: A Non-Blocking Buddy System for Multi-core Machines
TLDR
This article presents a fully non-blocking buddy-system, where threads performing concurrent allocations/releases do not undergo any spin-lock based synchronization, and allows threads to proceed in parallel, and commit their allocations/ releases unless a conflict is materialized while handling the allocator metadata.
Concurrent programming without locks
TLDR
This article presents three APIs which make it easier to develop nonblocking implementations of arbitrary data structures, and compares the performance of the resulting implementations against one another and against high-performance lock-based systems.
A Non-blocking Buddy System for Scalable Memory Allocation on Multi-core Machines
TLDR
This article presents a fully non-blocking buddy-system, that allows threads to proceed in parallel, and commit their allocations/releases unless a conflict is materialized while handling its metadata, which is resilient to performance degradation in face of concurrent accesses independently of the current level of fragmentation of the handled memory blocks.
A comprehensive study of Dynamic Memory Management in OpenCL kernels
TLDR
KMA, the first dynamic memory allocator for OpenCL kernels, is presented, which shows that by carefully analysing the limitations of the OpenCL environment it is possible to implement an in-kernel memory allocators, be it limited by the constraints imposed by the GPU platform and OpenCL.
...
...

References

SHOWING 1-10 OF 37 REFERENCES
Memory allocation for long-running server applications
TLDR
This work investigated how to build an allocator that is not only fast and memory efficient but also scales well on SMP machines, and designed and prototyped a new allocator, called LKmalloc, targeted for both traditional applications and server applications.
Hoard: a scalable memory allocator for multithreaded applications
TLDR
Hoard is the first allocator to simultaneously solve the above problems, and combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case.
Atomic heap transactions and fine-grain interrupts
TLDR
This work casts the space of possible implementations into a general taxonomy, and describes a new technique that provides a simple, low-overhead, low -latency interlock in the scheduler of the raw-hardware SML-based kernel, ML/OS.
Scheduler-conscious synchronization
TLDR
It is found that while it is possible to avoid pathological performance problems using previously proposed kernel mechanisms, a modest additional widening of the kernel/user interface can make scheduler-conscious synchronization algorithms significantly simpler and faster, with performance on dedicated machines comparable to that of Scheduler-oblivious algorithms.
Fast Multiprocessor Memory Allocation and Garbage Collection
TLDR
It is argued that a reasonable level of garbage collector scalability can be achieved with relatively minor additions to the underlying collector code, and that the scalable collector does not need to be appreciably slower on a uniprocessor.
Practical considerations for non-blocking concurrent objects
  • B. Bershad
  • Computer Science
    [1993] Proceedings. The 13th International Conference on Distributed Computing Systems
  • 1993
TLDR
The author examines the compare-and-swap operation in the content of contemporary bus-based shared memory multiprocessors, and it is shown that the common techniques for reducing synchronization overhead in the presence of contention are inappropriate when used as the basis for nonblocking synchronization.
Practical implementations of non-blocking synchronization primitives
TLDR
The results presented here eliminate the problem by providing practical means for implementing any algorithm that is based on these instructions on any multiprocessor that provides either CAS or a form of LL and SC that is sufficiently weak that it is provided by all current hardware implementations of these instructions.
Non-blocking synchronization and system design
TLDR
This thesis demonstrates that non-blocking synchronization is practical as the sole co-ordination mechanism in systems by showing that careful design and implementation of operating system software makes implementing efficient non- blocking synchronization far easier, and by demonstrating that DCASpDouble-Compare-and-Swapp is the necessary and sufficient primitive for implementing NBS.
READ-COPY UPDATE: USING EXECUTION HISTORY TO SOLVE CONCURRENCY PROBLEMS
TLDR
This paper proposes a novel and extremely efficient mechanism, called read-copy update, and compares its performance to that of conventional locking primitives.
Fast mutual exclusion for uniprocessors
TLDR
It is shown that improving the performance of low-level atomic operations, and therefore mutual exclusion mechanisms, improves application performance.
...
...