Fast dynamic memory allocator for massively parallel architectures

Authors:
Sven Widmer;Dominik Wodniok;Nicolas Weber;Michael Goesele
Affiliations:
Graduate School Computational Engineering, TU Darmstadt;Graduate School Computational Engineering, TU Darmstadt;TU Darmstadt;Graduate School Computational Engineering, TU Darmstadt
Venue:
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Year:
2013

Citing 6
Cited 1

Optimizing Dynamic Memory Management in a Multithreaded Application Executing on a Multiprocessor

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Dynamic Storage Allocation: A Survey and Critical Review

IWMM '95 Proceedings of the International Workshop on Memory Management
Scalable lock-free dynamic memory allocation

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
McRT-Malloc: a scalable transactional memory allocator

Proceedings of the 5th international symposium on Memory management
Modern Operating Systems

Modern Operating Systems
XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology

KMA: A Dynamic Memory Manager for OpenCL

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dynamic memory allocation in massively parallel systems often suffers from drastic performance decreases due to the required global synchronization. This is especially true when many allocation or deallocation requests occur in parallel. We propose a method to alleviate this problem by making use of the SIMD parallelism found in most current massively parallel hardware. More specifically, we propose a hybrid dynamic memory allocator operating at the SIMD parallel warp level. Using additional constraints that can be fulfilled for a large class of practically relevant algorithms and hardware systems, we are able to significantly speed-up the dynamic allocation. We present and evaluate a prototypical implementation for modern CUDA-enabled graphics cards, achieving an overall speedup of up to several orders of magnitude.