jitSim: a simulator for predicting scalability of parallel applications in presence of OS jitter

Authors:
Pradipta De;Vijay Mann
Affiliations:
IBM Research - India, New Delhi;IBM Research - India, New Delhi
Venue:
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Year:
2010

Citing 12
Cited 1

An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Improving application performance on HPC systems with process synchronization

Linux Journal
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Pace--A Toolset for the Performance Prediction of Parallel and Distributed Systems

International Journal of High Performance Computing Applications
Operating system issues for petascale systems

ACM SIGOPS Operating Systems Review
A performance comparison through benchmarking and modeling of three leading supercomputers: blue Gene/L, Red Storm, and Purple

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Characterizing application sensitivity to OS interference using kernel-level noise injection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
WARPP: a toolkit for simulating high-performance parallel scientific codes

Proceedings of the 2nd International Conference on Simulation Tools and Techniques
Handling OS jitter on multicore multithreaded systems

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Performance analysis of parallel programs via message-passing graph traversal

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Impact of noise on scaling of collectives: an empirical evaluation

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Extending and benchmarking the "Big Memory" implementation on Blue Gene/P Linux

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditionally, Operating system jitter has been a source of performance degradation for parallel applications running on large number of processors. While some large scale HPC systems such as Blue Gene/L and Cray XT4, mitigate jitter by making use of a specialized light-weight operating system on compute nodes, other clusters have attempted using HPC-ready commodity operating systems such as ZeptoOS (based on Linux). However, as large systems continue to be designed to work with commodity OSes, OS jitter still remains an active area of research within the HPC community. While, it is true that some of the specialized commodity OSes like ZeptoOS have relatively low OS jitter levels, there is still a need to have a quick and easy set of tools that can predict the impact of OS jitter at a given configuration and processor number. Such tools are also required to validate and compare any new techniques or OS enhancements that mitigate jitter. Emulating jitter on a large "jitter-free" platform using either synthetic jitter or real traces from commodity OSes has been proposed as one useful mechanism to study scalability behavior under the presence of jitter. However, this requires access to large scale jitter free systems, which are few in number and not so easily accessible. As new systems are built, that should scale up to a million tasks and more, the emulation approach is still limited by the largest jitter free system available. In this paper we present jitSim - a simulation framework for predicting scalability of parallel compute intensive applications in presence of OS jitter using trace driven simulation. The jitter simulation framework can be used to quickly simulate the effects of jitter that is characteristic of a given OS using a given trace. Furthermore, this system can be used to predict scalability up to any arbitrarily large number of task counts. Our methodology comprises of collection of real jitter traces, measurement of network latency, message passing stack latency, and shared memory latency. The simulation framework takes the above as inputs and then simulates multiple parallel tasks starting at randomly chosen points in the jitter trace and executing a compute phase. We validate the simulation results by comparing it with real data and demonstrate the efficacy of the simulation framework by evaluating various jitter mitigation techniques through simulation.