Optimizing process creation and execution on multi-core architectures

Authors:
Abhishek Kulkarni;Latchesar Ionkov;Michael Lang;Andrew Lumsdaine
Affiliations:
Center for Research in Extreme Scale Technologies, Department of Computer Science, Indiana University, Bloomington, IN, USA, Ultrascale Systems Research Center, Los Alamos National Laboratory, Los ...;Ultrascale Systems Research Center, Los Alamos National Laboratory, Los Alamos, NM, USA;Ultrascale Systems Research Center, Los Alamos National Laboratory, Los Alamos, NM, USA;Center for Research in Extreme Scale Technologies, Department of Computer Science, Indiana University, Bloomington, IN, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2013

Citing 11
Cited 0

Operating system benchmarking in the wake of lmbench: a case study of the performance of NetBSD on the Intel x86 architecture

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Overcoming Scalability Challenges for Tool Daemon Launching

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Helios: heterogeneous multiprocessing with satellite kernels

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
ScELA: scalable and extensible launching architecture for clusters

HiPC'08 Proceedings of the 15th international conference on High performance computing
An analysis of Linux scalability to many cores

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
FlexSC: flexible system call scheduling with exception-less system calls

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
The case for VOS: the vector operating system

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
PTask: operating system abstractions to manage GPUs as compute devices

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Gdev: first-class GPU resource management in the operating system

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
LIBI: A framework for bootstrapping extreme scale software systems

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The execution of a single process multiple data (SPMD) application involves running multiple instances of a process with possibly varying arguments. With the widespread adoption of massively multicore processors, there has been a focus towards harnessing the abundant compute resources effectively in a power-efficient manner. Although much work has been done towards optimizing distributed process launch using hierarchical techniques, there has been a void in studying the performance of spawning processes within a single node. Reducing the latency to spawn a new process locally results in faster global job launch. Further, emerging dynamic and resilient execution models are designed on the premise of maintaining process pools for fault isolation and launching several processes in a relatively shorter period of time. Optimizing the latency and throughput for spawning processes would help improve the overall performance of runtime systems, allow adaptive process-replication reliability and motivate the design and implementation of process management interfaces in future manycore operating systems. In this paper, we study the several limiting factors for efficient spawning of processes on massively multicore architectures. We have developed a library to optimize launching multiple instances of the same executable. Our microbenchmarks show a 20-80% decrease in the process spawn time for multiple executables. We further discuss the effects of memory locality and propose NUMA-aware extensions to optimize launching processes with large memory-mapped segments including dynamic shared libraries. Finally, we describe vector operating system interfaces for spawning a batch of processes from a given executable on specific cores. Our results show a speedup of a factor of 40-50 over the traditional method of launching new processes using fork and exec system calls.