Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

Authors:
Terry Jones;Shawn Dawson;Rob Neely;William Tuel;Larry Brenner;Jeffrey Fier;Robert Blackmore;Patrick Caffrey;Brian Maskell;Paul Tomlinson;Mark Roberts
Affiliations:
Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA;International Business Machines Corporation, Armonk, NY;International Business Machines Corporation, Armonk, NY;International Business Machines Corporation, Armonk, NY;International Business Machines Corporation, Armonk, NY;International Business Machines Corporation, Armonk, NY;Atomic Weapons Establishment, Aldermaston Reading, UK;Atomic Weapons Establishment, Aldermaston Reading, UK;Atomic Weapons Establishment, Aldermaston Reading, UK
Venue:
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Year:
2003

Citing 9
Cited 38

Effective use of Cray supercomputers

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The impact of operating system scheduling policies and synchronization methods of performance of parallel applications

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Operating system support for parallel programming on RP3

IBM Journal of Research and Development
Effective distributed scheduling of parallel workloads

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A closer look at coscheduling approaches for a network of workstations

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
The UNIX time-sharing system

Communications of the ACM
Paging tradeoffs in distributed-shared-memory multiprocessors

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A Gang-Scheduling System for ASCI Blue-Pacific

HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Job Scheduling Under the Portable Batch System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing

IMPuLSE: integrated monitoring and profiling for large-scale environments

LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
System noise, OS clock ticks, and fine-grained parallel applications

Proceedings of the 19th annual international conference on Supercomputing
Towards a framework for dedicated operating systems development in high-end computing systems

ACM SIGOPS Operating Systems Review
Operating system issues for petascale systems

ACM SIGOPS Operating Systems Review
HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review
The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)

Proceedings of the 2007 workshop on Experimental computer science
The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops)

ecs'07 Experimental computer science on Experimental computer science
High-level application-specific performance analysis using the G-PM tool

Future Generation Computer Systems
Integrated parallel performance views

Cluster Computing
Benchmarking the effects of operating system interference on extreme-scale parallel machines

Cluster Computing
Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking
Automatic software interference detection in parallel applications

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
The ghost in the machine: observing the effects of kernel operation on parallel application performance

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Characterizing application sensitivity to OS interference using kernel-level noise injection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Improving performance by embedding HPC applications in lightweight Xen domains

Proceedings of the 2nd workshop on System-level virtualization for high performance computing
The Impact of noise on the scaling of collectives: the nearest neighbor model

HiPC'07 Proceedings of the 14th international conference on High performance computing
New challenges of parallel job scheduling

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
jitSim: a simulator for predicting scalability of parallel applications in presence of OS jitter

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Power-performance efficiency of asymmetric multiprocessors for multi-threaded scientific applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Linux kernel co-scheduling for bulk synchronous parallel applications

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Juggle: proactive load balancing on multicore computers

Proceedings of the 20th international symposium on High performance distributed computing
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
The impact of noise on the scaling of collectives: a theoretical approach

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Assessing MPI performance on QsNetIIt

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Early experiences with KTAU on the IBM BG/L

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Virtual InfiniBand clusters for HPC clouds

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Linux kernel co-scheduling and bulk synchronous parallelism

International Journal of High Performance Computing Applications
Enabling event tracing at leadership-class scale through I/O forwarding middleware

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Stepping towards noiseless Linux environment

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Evaluation of a high-volume, low-latency market data processing system implemented with IBM middleware

Software—Practice & Experience
Application-driven analysis of two generations of capability computing: the transition to multicore processors

Concurrency and Computation: Practice & Experience
The impact of system design parameters on application noise sensitivity

Cluster Computing
Understanding and isolating the noise in the Linux kernel

International Journal of High Performance Computing Applications
There goes the neighborhood: performance degradation due to nearby jobs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Juggle: addressing extrinsic load imbalances in SPMD applications on multicore computers

Cluster Computing
Amdahl's law in the era of process variation

International Journal of High Performance Systems Architecture
Optimizing I/O forwarding techniques for extreme-scale event tracing

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A parallel application benefits from scheduling policies that include a global perspective of the application's process working set. As the interactions among cooperating processes increase, mechanisms to ameliorate waiting within one or more of the processes become more important. In particular, collective operations such as barriers and reductions are extremely sensitive to even usually harmless events such as context switches among members of the process working set. For the last 18 months, we have been researching the impact of random short-lived interruptions such as timer-decrement processing and periodic daemon activity, and developing strategies to minimize their impact on large processor-count SPMD bulk-synchronous programming styles. We present a novel co-scheduling scheme for improving performance of fine-grain collective activities such as barriers and reductions, describe an implementation consisting of operating system kernel modifications and run-time system, and present a set of empirical results comparing the technique with traditional operating system scheduling. Our results indicate a speedup of over 300% on synchronizing collectives.