Scalable SIMD-parallel memory allocation for many-core machines

Authors:
Xiaohuang Huang;Christopher I. Rodrigues;Stephen Jones;Ian Buck;Wen-Mei Hwu
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, USA 61801;University of Illinois at Urbana-Champaign, Urbana, USA 61801;NVIDIA Corporation, Santa Clara, USA 95050;NVIDIA Corporation, Santa Clara, USA 95050;University of Illinois at Urbana-Champaign, Urbana, USA 61801
Venue:
The Journal of Supercomputing
Year:
2013

Citing 10
Cited 0

Wait-free synchronization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Hoard: a scalable memory allocator for multithreaded applications

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Mostly lock-free malloc

Proceedings of the 3rd international symposium on Memory management
Space Efficient Parallel Buddy Memory Management

ICCI '92 Proceedings of the Fourth International Conference on Computing and Information: Computing and Information
Scalable lock-free dynamic memory allocation

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Understanding and Effectively Preventing the ABA Problem in Descriptor-Based Lock-Free Designs

ISORC '10 Proceedings of the 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing
XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Parallel dynamic storage allocation algorithms

SPDP '93 Proceedings of the 1993 5th IEEE Symposium on Parallel and Distributed Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dynamic memory allocation is an important feature of modern programming systems. However, the cost of memory allocation in massively parallel execution environments such as CUDA has been too high for many types of kernels. This paper presents XMalloc, a high-throughput memory allocation mechanism that dramatically magnifies the allocation throughput of an underlying memory allocator. XMalloc embodies two key techniques: allocation coalescing and buffering using efficient queues. This paper describes these two techniques and presents our implementation of XMalloc as a memory allocator library. The library is designed to be called from kernels executed by massive numbers of threads. Our experimental results based on the NVIDIA G480 GPU show that XMalloc magnifies the allocation throughput of the underlying memory allocator by a factor of 48.