SFMalloc: A Lock-Free and Mostly Synchronization-Free Dynamic Memory Allocator for Manycores

Authors:
Sangmin Seo;Junghyun Kim;Jaejin Lee
Affiliations:
-;-;-
Venue:
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Year:
2011

Citing 0
Cited 2

SSMalloc: a low-latency, locality-conscious memory allocator with stable performance scalability

Proceedings of the Asia-Pacific Workshop on Systems
SSMalloc: a low-latency, locality-conscious memory allocator with stable performance scalability

APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

As parallel programming becomes the mainstream due to multicore processors, dynamic memory allocators used in C and C++ can suppress the performance of multi-threaded applications if they are not scalable. In this paper, we present a new dynamic memory allocator for multi-threaded applications. The allocator never uses any synchronization for common cases. It uses only lock-free synchronization mechanisms for uncommon cases. Each thread owns a private heap and handles memory requests on the heap. Our allocator is completely synchronization-free when a thread allocates a memory block and deal locates it by itself. Synchronization-free means that threads do not communicate with each other at all. On the other hand, if a thread allocates a block and another thread frees it, we use a lock-free stack to atomically add it to the owner thread's heap to avoid the memory blowup problem. Furthermore, our allocator exploits various memory block caching mechanisms to reduce the latency of memory management. Freed blocks or intermediate memory chunks are cached hierarchically in each thread's heap and they are used for future memory allocation. We compare the performance and scalability of our allocator to those of well-known existing multi-threaded memory allocators using eight benchmarks. Experimental results on a 48-core AMD system show that our approach achieves better performance than other allocators for all benchmarks and is highly scalable with a large number of threads.