ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Hoard: a scalable memory allocator for multithreaded applications
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Proceedings of the 3rd international symposium on Memory management
Space Efficient Parallel Buddy Memory Management
ICCI '92 Proceedings of the Fourth International Conference on Computing and Information: Computing and Information
Scalable lock-free dynamic memory allocation
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Understanding and Effectively Preventing the ABA Problem in Descriptor-Based Lock-Free Designs
ISORC '10 Proceedings of the 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing
XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines
CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Parallel dynamic storage allocation algorithms
SPDP '93 Proceedings of the 1993 5th IEEE Symposium on Parallel and Distributed Processing
Hi-index | 0.00 |
Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.